ChatGPT, OpenAI’s powerful natural language AI tool, is becoming an integral part of daily workflow for thousands of developers, writers, and students. The chatbot has been deployed in a wide variety of ways since its release at the end of November last year, and the high standard of its output has led to calls for ways of identifying text written using the tool to be developed. A metadata ‘watermark’ could be one way to reduce the risk of mass misinformation campaigns, plagiarism, and its use in phishing attacks, engineers believe.
Like in photographs that can contain details of the photographer, or images generated by OpenAI’s DALL-E 2 tool, the metadata would contain a marker that could flag whether the text was written by AI and if so how much came from automation.
The use of AI in writing essays has been dubbed “AIgiarism” – AI-assisted plagiarism – and when detected is being treated as seriously as plagiarism. OpenAI’s own terms require users to identify where AI is used to generate content when publishing, but though many suspected cases of AI-generated essays have been pinpointed by academics, proving a machine has written a piece is difficult.
Speaking at the University of Texas, OpenAI researcher Scott Aaronson said his team was looking at a number of solutions, including tweaking the choice of words selected by ChatGPT in a way that wouldn’t be noticeable by the reader but could be picked up by a tool looking for signs of generated text.
Summarising his lecture in a blog post, Aaronson said the aim is to make it much harder to take a ChatGPT output and pass it off as if it came from a human, explaining that these signatures could be incorporated into plagiarism checkers already used by universities.
He said this would also be used for spotting the “mass generation of propaganda – you know, spamming every blog with seemingly on-topic comments supporting Russia’s invasion of Ukraine without even a building full of trolls in Moscow. Or impersonating someone’s writing style in order to incriminate them.”
Prototype watermarking tool for ChatGPT
OpenAI has a working prototype of the watermarking scheme that “seems to work pretty well”, according to Aaronson. His blog suggests that a few hundred tokens – or a paragraph of text – is the point needed to get a reasonable signal that the text came from GPT-3, the large language AI model on which ChatGPT is built.
Aaronson said in his blog post that GPT isn’t the gold standard of essay writing yet, but that it is improving with each new version. The biggest issue it faces is the number of accurate-looking but obviously wrong answers it generates. But this is improving, as is its ability to generate real citations.
“If you turned in GPT’s essays I think they’d get at least a B in most courses,” he wrote, “not that I endorse any of you do that, but yes we are about to enter a world where students everywhere will at least be sorely tempted to use text models to write their term papers.”
He said GPT-3 is already being used to write advertising copy, press releases and even full novels at the “lower end of the book market” for the kind of formulaic genre fiction where you can say “give me a few paragraphs of description of this kind of scene” and have it do it and this will only increase as the models improve.
“GPT, I think, is already a pretty good poet. DALL-E is already a pretty good artist,” Aaronson said.” They’re still struggling with some high school and college-level math but they’re getting there. It’s easy to imagine that maybe in five years, people like me will be using these things as research assistants – at the very least, to prove the lemmas in our papers. That seems extremely plausible.”
Aaronson’s main work at OpenAI is ensuring that teachers and others have a way to recognise when copy comes from GPT through statistically watermarking outputs and finding ways to ensure that it can easily be identified when used “in the wild”. “When you think about the nefarious uses for GPT, most of them require somehow concealing GPT’s involvement, in which case, watermarking would simultaneously attack most misuses,” he wrote.
Manipulating a ‘string of tokens’ in GPT
So far the best way they’ve found to achieve this is by manipulating the “string of tokens” that make up every input and output used in the GPT models. These strings are used for words, punctuation marks and parts of words. There are 100,000 tokens and GPT is constantly generating a probability distribution over the next token to generate.
It samples a token according to the distribution and includes a parameter called “temperature” and if that is not zero there will be some randomness in the next token choice. Which is how you could run the same prompt over and over and get a different output each time.
To watermark the output, Aaronson and his team worked on selecting the next token pseudo-randomly using a cryptographic function, rather than being true random. The key is only known to OpenAI and won’t be detectable to the end user. The pseudo-random function creates a score, a sum over a certain value in a sequence of tokens and if you know this score and the key you can determine that it was likely generated using GPT.
The other approach, which he says may be needed in the future in high-stakes use cases, is to store every output into a large database that can be queried against when there is a contention, but this comes with a “serious privacy concern” including other how you prove it wasn’t GPT-generated text without revealing how people use the AI.
A tool using the watermarking function has been built by OpenAI engineer Hendrik Kirchner and can detect whether something is GPT-generated as long as it is a few hundred tokens, which is a paragraph of text. “In principle, you could even take a long text and isolate which parts probably came from GPT and which parts probably didn’t,” said Aaronson.
It will also still keep the watermarking signal if the user rearranges the order of sentences or deletes a few words because it depends on a calculation of tokens over the full text, rather than random sentences which means it is “robust against those sorts of interventions.”
The team are also working on ways to better watermark images generated by DALL-E that aren’t within the image itself but at a wider conceptual level, a “CLIP representation” produced before the image is made by the AI “but we don’t know if that’s going to work yet.”