OpenAI Unveils Enhanced GPT-4: A Multimodal Model Capable of Processing Text, Audio, and Images

OpenAI has recently launched its most advanced artificial intelligence model to date, the GPT-4o, marking a significant leap towards more natural interactions between humans and computers. The "o" in the name stands for "omni," indicating the model’s ability to handle a combination of text, audio, and images, and output data in all three formats. This capability allows GPT-4o to not only process information but also recognize emotions, interrupt its own speech, and respond as quickly as a human during conversations.

Mira Murati, Chief Technology Officer at OpenAI, expressed that while the intelligence level remains on par with GPT-4, the new algorithm boasts enhanced capabilities across various modalities and environments. "In the last couple of years, we have focused on enhancing the intelligence of our models. This is the first time we are making a significant step forward in terms of ease of use," Murati noted during the presentation.

Live Demonstrations Highlight GPT-4o’s Capabilities

During its unveiling, OpenAI showcased GPT-4o's real-time capabilities, including translating between English and Italian, assisting a researcher in solving a linear equation on paper, and offering deep breathing tips to the laboratory head. These demonstrations highlight GPT-4o's versatility and immediate responsiveness, which are essential for real-world applications.

What Sets GPT-4o Apart from Its Predecessors

Unlike its predecessor, GPT-4 Turbo, which could analyze images and text for tasks like extracting written content from pictures or describing their contents, GPT-4o introduces speech processing. This integration of three data formats—text, audio, and images—into a single neural network processing stream represents a departure from earlier models like GPT-3.5 and GPT-4, which could only process voice queries by transcribing sound into text, a method that stripped the interaction of intonations and emotions, resulting in slower responses.

With GPT-4o, using ChatGPT feels more like conversing with a human assistant. For instance, when interacting with the chatbot, users can now interrupt during a response. According to OpenAI, the algorithm provides "real-time" reactions and can even capture the nuances of a user's voice, generating responses in various emotional styles, including singing.

Enhanced "Vision," Language, and Speech Capabilities

GPT-4o extends ChatGPT's capabilities in terms of vision. Upon receiving a photograph or a desktop screen capture, the chatbot can quickly respond to related questions, ranging from "what is happening in this code?" to "what brand of shirt is this person wearing?" Murati mentioned that these functionalities would continue to evolve. For example, GPT-4o can currently scan a foreign language menu and translate it, but future enhancements will allow ChatGPT to "watch" a live sports game and explain the rules.

The lab also stated that the new algorithm is more multilingual, understanding about 50 languages, and runs twice as fast as GPT-4 Turbo via the OpenAI API and Azure OpenAI Service, offered by Microsoft. These improvements make the new model cheaper and less limited in speed compared to its predecessors.

Selective Voice Feature Rollout and Broad Accessibility

Initially, voice support in the GPT-4o API will not be available to all customers. Citing the risk of misuse, the company plans to first launch this feature for a "small group of trusted partners" in the coming weeks.

However, OpenAI will provide the new model to all users, including those using ChatGPT for free, within the next few weeks. Premium subscription holders of Plus and Team will have access with significantly fewer usage limitations.

New User Interface and Applications for ChatGPT

OpenAI also announced the launch of an updated user web interface for ChatGPT with a more dialogic main screen and message layout. Additionally, a desktop version of the chatbot for macOS is now available to paid users, with a Windows version expected later this year. Moreover, free users of ChatGPT will gain access to the GPT Store—a library and tools for creating third-party AI chatbots—as well as some previously paid options like the "memory" function.

Previously, it was reported that on May 13, OpenAI would introduce an AI-based search engine, further expanding its portfolio of innovative technologies.