Rumours have surrounded the size, performance and abilities of GPT-4, the next-generation large language model from OpenAI, since the company launched GPT-3 in June 2020. This has only intensified since the unexpected success of ChatGPT and the latest rumour comes from Microsoft in Germany suggesting the tool will be able to analyse and produce more than just text. This could allow users to turn an organisational chart into a text report, or create a mood board from a video.
Microsoft is a major partner for OpenAI, investing billions in the start-up since 2019 and utilising its models in a range of products. Speaking at an event in Germany, Andreas Braun, CTO at Microsoft Germany said GPT-4 was coming next week and “will have multimodal models that will offer completely different possibilities – for example, videos”.
It is also rumoured that the model will be of a similar size or smaller than the 175 billion-parameter GPT-3 due to improved optimisation and efficiency efforts. If true this will see OpenAI follow a trend set by Meta with its LLaMA model and AI21 Labs with Jurassic-2. A long-standing rumour that it will have more than 100 trillion parameters has been debunked by OpenAI founder Sam Altman.
If, as Braun suggests, the next generation of OpenAI’s flagship large language model is multimodal it could prove to be a revolutionary technology as it would be able to analyse and generate video, images and possibly audio as well. It could be used to produce multimedia output and take inputs from a range of different forms of media.
Multimodal models are nothing new. OpenAI’s own DALL-E is a form of multimodal AI, trained on both text and images to allow for text-to-image or image-to-image generation. CLIP is another OpenAI model developed to associate visual concepts with language. It is trained to distinguish between similar and dissimilar inputs by maximising the agreement between them.
It can be used for image classification, object detection and image retrieval. CLIP can also be used for zero-shot learning which is the ability to perform a task without any prior training or example. Microsoft itself has been experimenting with multi-modal AI models already, and earlier this month released details of Kosmos-1, a model which can draw on data from text and images.
Multi-modal AI: multimedia input and output
Very little specific information has been revealed about GPT-4 beyond the fact it will likely outperform the hugely successful GPT-3 and its interim successor GPT-3.5, which is a fine-tuned version of the original model. The comments from Microsoft Germany suggest multi-modality, which could be anything from accepting image or video inputs, to being able to produce a movie.
James Poulter, CEO of Voice AI company Vixen Labs says the former is most likely. “If GPT-4 becomes multi-modal in this way it opens up a whole load of new use cases. For example being able to summarise long-form audio and video like podcasts and documentaries, or being able to extract meaning and patterns from large databases of photos and provide answers about what they contain.”
Many of the big LLM providers are looking at ways to integrate their models with other tools such as knowledge graphs, generative AI models and multimodal outputs but Poulter says “the speed in which OpenAI has scaled the adoption of ChatGPT and GPT3.5 puts it way out in front in terms of enterprise and consumer trust.”
One of the most likely use cases for multimedia input is in speech recognition or automatic transcription of audio or video, predicts AI developer Michal Stanislawek. This will build on the recently released Whisper API that can quickly transcribe speech into text and synthetic voice generation. “I hope that this also means being able to send images and possibly videos and continue conversation based on their contents,” he says.
“Multi-modality will be a huge change in how people utilise AI and what new use cases it can support. Entire companies will be built based on it,” adds Stanislawek, giving the example of synthetic commentators for sports games in multiple languages, summarising real-time meetings and events and analysing graphs to extract more meaning.
Will GPT-4 be truly multi-modal?
Conversational AI expert Kane Simms agrees, adding that multi-modal input rather than output is the most likely, but that if it is output-based then “you’re in interesting territory,” suggesting it could be used to generate a video from an image and audio file or create a “mood board” from a video.
However, Mark L’Estrange, a senior lecturer in e-sports at Falmouth University’s Games Academy told Tech Monitor it is unlikely to be true multi-modal in the true sense of the word as that requires much more development and compute power. “Multi-modal means that you can give it verbal cues, you can upload pictures, you can give it any input whatsoever and it understands it and in context produces anything you want,” he says, adding “right now we have a very fractured framework.”
He said that will come, describing it as ‘universal-modal’, where you could, through a series of inputs and prompts, generate something like a game prototype that can then be worked up into a full game using human input and talent. “The human input is what’s required to make these unique games that have these unique visions and to choose the right outputs from the AI. So maybe a team that was 40 or 50 people before would now be 20 people.”
Even if it is only partially multi-modal, able to take a simple image input and generate a text report, this could be significant for enterprise. It would allow a manager to submit a graph of performance metrics across different software options and have the AI generate a full report, or a CEO send an organisation chart and have the AI suggest optimisations and changes for best performance.