OpenAI, the non-profit research agency backed by Elon Musk, says it has created an AI capable of generating intelligible text without explicit training. OpenAI has declined to release the full research due to concerns over potential “malicious applications of the technology.”
The AI model is essentially a powerful text generator capable of generating substantial human-like reports from a small sample text or headline.
We've trained an unsupervised language model that can generate coherent paragraphs and perform rudimentary reading comprehension, machine translation, question answering, and summarization — all without task-specific training: https://t.co/sY30aQM7hU pic.twitter.com/360bGgoea3
— OpenAI (@OpenAI) February 14, 2019
When the model was fed this human constructed headline: “A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.” The machine produced this news story on its first try:
“The incident occurred on the downtown train line, which runs from Covington and Ashland stations.”
“In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief. “The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement.
“Our top priority is to secure the theft and ensure it doesn’t happen again.””
The systems is not perfect yet; as the researchers note, its results include various failures such as repetitive text and world breaking contradictions, e.g writing about fires occurring under water. (Technically magnesium fires can burn underwater…)
Read this: The Deepfake Threat
Yet OpenAI have deemed the project so successful that they are breaking with their publishing norms in order to stop it been used by threat actors. The fear is that it will be used to mass create malicious content, such as misleading headlines and automated faked or abusive content on social media.
The researchers believe that their own findings: “Combined with earlier results on synthetic imagery, audio, and video, imply that technologies are reducing the cost of generating fake content and waging disinformation campaigns.”
OpenAI Holds Back Full AI Model They Created
The language based model uses a dataset of 8 million web pages and is trained with one objective, predict the next word. It does this by using all the words in the given sample. Incredibly it is able to work out style and subject from short samples of text.
The model, called GPT-2, is trained to predict the next word in a given text using a dataset of millions of webpages called WebText. “GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText,” OpenAI’s research paper states.
“GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data,” OpenAI note.
The model works in an unsupervised method using raw text and no task-specific training data to answers questions, translate or complete summarisations of text.
“GPT-2 generates synthetic text samples in response to the model being primed with an arbitrary input. The model is chameleon-like — it adapts to the style and content of the conditioning text,” OpenAI have found.
They have released a small sample of GPT-2 for testing on GitHub.
Even though they are holding back the full code for the AI they are quick to point out that they are aware that others can just reproduce the results and make it fully open source. However, they believe that their release strategy limits how many organisations would be capable of achieving that at this stage.
Yet this only slows down the inevitable production of a similar AI model, perhaps by threat actors themselves. OpenAI hopes that in the meantime they have bought the AI community more time to discuss the wider implications that arise out of this type of text or image faking technology.