OpenAI subverts the world: GPT-4o is completely free, real-time voice and video interaction shocks the whole audience, and directly enters the science fiction era Sina Finance Sina.com

Special topic: OpenAI releases the latest flagship model GPT-4o, which is full of free voice function

Source: Heart of Machine

Author: Machine Heart Editorial Department

Just 17 months after ChatGPT came out, OpenAI produced the super AI in science fiction movies, which is completely free and available to everyone.

It's shocking!

When various technology companies are still chasing the multimodal capability of the big model and putting summary text, P chart and other functions into their mobile phones, the far leading OpenAI directly launched a big move, and even its CEO Altman marveled at the products released: just like in a movie.

In the early morning of May 14, OpenAI launched the new generation flagship generation model GPT-4o and desktop app at the first "Spring New Product Launch", and demonstrated a series of new capabilities. This time, technology has subverted the product form, and OpenAI has taught technology companies around the world a lesson with its actions.

Today's host is Mira Murati, the chief technology officer of OpenAI. She said that today she would mainly talk about three things:

First, in the future, OpenAI will make products to Free first , so that more people can use it.
Second, OpenAI has released Desktop version of the program and updated UI , which is simpler and more natural to use.
Third, after GPT-4, a new version of the big model came, called GPT-4o 。 GPT-4o is special in that it brings GPT-4 level intelligence to everyone in a very natural interactive way, including free users.

After this update of ChatGPT, the large model can receive any combination of text, audio and image as input, and generate any combination of text, audio and image output in real time - this is the future interaction mode.

Recently, ChatGPT can be used without registration. Today, desktop programs have been added. The goal of OpenAI is to enable people to use it anytime, anywhere and without feeling, so that ChatGPT can be integrated into your workflow. This AI is now productivity.

GPT-4o is a new big model for the future human-computer interaction paradigm. It has the understanding of text, voice, and image modes. It is very responsive, emotional, and human.

On site, OpenAI engineers took out an iPhone to demonstrate several main capabilities of the new model. The most important thing is real-time voice dialogue. Mark Chen said: 'I was a little nervous about the first live broadcast of the conference.' ChatGPT said, why don't you take a deep breath.

OK, I'll take a deep breath.

ChatGPT immediately replied that you can't do this, and you are too winded.

If you have used voice assistants like Siri before, you can see the obvious difference here. First of all, you can interrupt AI at any time and continue the next conversation without waiting for it to finish. Secondly, you don't have to wait. The model responds very quickly, faster than the human response. Third, the model can fully understand human emotions, and it can also show various emotions.

Then came visual ability. Another engineer wrote the equation on paper, asking ChatGPT not to give the answer directly, but to explain how to do it step by step. It seems that it has great potential in teaching people how to do problems.

Next, try GPT-4o's code capabilities. Here are some codes. Turn on the desktop version of ChatGPT in the computer to interact with it by voice, and let it explain what the code is used to do, and what a function is doing. ChatGPT answers all the questions.

The result of the output code is a temperature curve graph, which allows ChatGPT to respond to all questions about this graph in a sentence.

The hottest month is in a few months. It can tell whether the Y axis is in Celsius or Fahrenheit.

OpenAI also responded to some real-time questions raised by X/Twitter users. For example, real-time voice translation, mobile phones can be used as translation machines to translate Spanish and English back and forth.

Another person asked, can ChatGPT recognize your expression?

It seems that GPT-4o has been able to achieve real-time video understanding.

Next, let's learn more about the nuclear bomb released by OpenAI today.

　　 Universal model GPT-4o

The first one is GPT-4o, which represents Omnimodel (omnipotent model).

For the first time, OpenAI integrates all modes in one model, greatly improving the practicability of large models.

OpenAI CTO Muri Murati said that GPT-4o provides' GPT-4 level 'intelligence, but improves the ability of text, vision and audio based on GPT-4, and will be launched in the company's products' iteratively' in the next few weeks.

'The reasons for GPT-4o span voice, text and vision,' Muri Murati said: 'We know that these models are becoming more and more complex, but we want the interaction experience to become more natural and simple, so that you don't need to pay attention to the user interface at all, but only focus on collaboration with GPT.'

GPT-4o's performance in English text and code matches that of GPT-4 Turbo, but its performance in non English text is significantly improved. At the same time, the API is also faster, and the cost is reduced by 50%. Compared with existing models, GPT-4o is particularly good at visual and audio understanding.

It can respond to audio input in 232 milliseconds at the fastest, with an average response time of 320 milliseconds, similar to human beings. Before the release of GPT-4o, users who have experienced ChatGPT voice conversation ability can perceive that the average latency of ChatGPT is 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).

This voice response mode is a pipeline composed of three independent models: a simple model transcribes audio into text, GPT-3.5 or GPT-4 receives text and outputs text, and a third simple model converts the text back to audio. However, OpenAI found that this method means that GPT-4 will lose a lot of information. For example, the model cannot directly observe the tone, multiple speakers or background noise, nor can it output laughter, singing or expressing emotion.

On GPT-4o, OpenAI has trained a new model end-to-end across text, visual and audio, which means that all inputs and outputs are processed by the same neural network.

'From a technical point of view, OpenAI has found a way to directly map audio to audio as a primary mode, and transmit video to the transformer in real time. These require some new research on tokenization and architecture, but it is generally a problem of data and system optimization (most things are like this). " Jim Fan, a scientist at Nvidia, commented.

GPT-4o can perform real-time reasoning across text, audio and video, which is an important step towards more natural human-computer interaction (even human machine machine interaction).

Greg Brockman, president of OpenAI, also made the two GPT-4os talk online and improvise a song. Although the melody is a bit "touching", the lyrics cover the decoration style of the room, the wearing characteristics of the characters, and the episode during the period.

In addition, GPT-4o's ability to understand and generate images is much better than any existing model, and many impossible tasks have become 'easy' before.

For example, you can ask it to help print the logo of OpenAI on the coaster:

After this period of technical research, OpenAI should have perfectly solved the problem of ChatGPT generating fonts.

At the same time, GPT-4o also has the ability to generate 3D visual content, and can perform 3D reconstruction from six generated images:

This is a poem. GPT-4o can set it in a handwritten style:

More complex typography can also be solved:

In cooperation with GPT-4o, you only need to input a few paragraphs of text to get a continuous set of comic separations:

The following play methods should surprise many designers:

There are also some niche functions, such as' text to art word ':

　　 GPT-4o performance evaluation results

Members of the OpenAI technical team said on X that the mysterious model 'im-also-a-good-gpt2-chatbot', which had previously aroused widespread debate on LMSYS Chatbot Arena, is a version of GPT-4o.

On the difficult prompt set - especially in coding: GPT-4o has a particularly significant performance improvement compared with the best model before OpenAI.

Specifically, in many benchmark tests, GPT-4o has achieved GPT-4 Turbo level performance in terms of text, reasoning and coding intelligence, and achieved new heights in multilingual, audio and visual functions.

Reasoning improvement: GPT-4o has set a new high score of 87.2% on 5-shot MMLU (common sense questions). (Note: Llama3 400b is still in training)

M3Exam benchmark is both a multilingual assessment benchmark and a visual assessment benchmark. It consists of multiple multiple-choice questions from standardized tests in many countries/regions, and includes graphs and charts. GPT-4o is stronger than GPT-4 in all language benchmarks.

In the future, the improvement of model capability will enable more natural and real-time voice dialogue, and can communicate with ChatGPT through real-time video. For example, users can show ChatGPT a live sports game and ask it to explain the rules.

　　 ChatGPT users will get more advanced functions for free

More than 100 million people use ChatGPT every week. OpenAI said that the text and image functions of GPT-4o began to be launched in ChatGPT for free today, and provided Plus users with up to five times the message limit.

Now open ChatGPT, and we find that GPT-4o is ready for use.

When using GPT-4o, ChatGPT free users can now access the following functions: experience GPT-4 level intelligence; Users can get responses from models and networks.

In addition, free users can also have the following options——

Analyze the data and create a chart:

Talk to the photos taken:

Upload files for help with summary, writing, or analysis:

Discover and use GPTs and GPT Store:

And use memory functions to create more helpful experiences.

However, according to the usage and needs, the number of messages that free users can send using GPT-4o will be limited. When the limit is reached, ChatGPT will automatically switch to GPT-3.5 so that the user can continue the conversation.

In addition, OpenAI will also launch a new version of voice mode GPT-4o alpha in ChatGPT Plus in the next few weeks, and introduce more new audio and video functions for GPT-4o to a small number of trusted partners through API.

Of course, through many model tests and iterations, GPT-4o has some limitations in all modes. In these imperfections, OpenAI said that it was working hard to improve GPT-4o.

It is conceivable that the opening of GPT-4o audio mode will definitely bring various new risks. In terms of security, GPT-4o has built security into the cross modal design by filtering training data and refining model behavior after training. OpenAI has also created a new security system to protect voice output.

　　 New desktop app simplifies user workflow

For free and paid users, OpenAI also launched a new ChatGPT desktop application for macOS. With a simple keyboard shortcut (Option+Space), users can immediately ask questions to ChatGPT. In addition, users can directly capture screenshots in the application and discuss them.

Now, users can also have a voice conversation with ChatGPT directly from the computer. The audio and video functions of GPT-4o will be introduced in the future. Click the headset icon in the lower right corner of the desktop application to start a voice conversation.

From today, OpenAI will launch macOS application to Plus users, and will provide the application more widely in the coming weeks. In addition, OpenAI will launch Windows version later this year.

　　 Altman: You open source, we free

After the release, Sam Altman, CEO of OpenAI, published a blog article that he had not seen for a long time, introducing his mental journey in promoting the work of GPT-4o:

In our announcement today, I want to emphasize two things.

First, a key part of our mission is to provide people with powerful AI tools for free (or at a preferential price). I am very proud to announce that we provide the best model in the world for free in ChatGPT, without advertising or similar things.

When we founded OpenAI, our original idea was that we should create artificial intelligence and use it to create various benefits for the world. Now the situation has changed. It seems that we will create artificial intelligence, and then others will use it to create all kinds of amazing things. All of us will benefit from it.

Of course, as an enterprise, we will invent many things that charge fees, which will help us provide free and excellent AI services to billions of people (hope so).

Secondly, the new voice and video mode is the best computing interactive interface I have ever used. It feels like AI in the movie. I'm still a little surprised that it is true. Facts have proved that reaching the human level of response time and expression ability is a huge leap.

The original ChatGPT implied the possibility of language interface, but this new thing (GPT-4o version) gives people an essentially different feeling - it is fast, intelligent, interesting, natural and can help people.

For me, it's never natural to interact with computers, as it is. And when we add (optional) personalization, access personal information, let AI take action instead of people, and so on, I can really see an exciting future where we can use computers to do more things than ever before.

Finally, thank the team for their great efforts to achieve this goal!

It is worth mentioning that Altman said in an interview last week that although it is difficult to achieve universal basic income, we can achieve 'universal basic compute for free'. In the future, all people can get the computing power of GPT for free, and can use, resell or donate it.

"The idea is that as AI becomes more advanced and embedded in all aspects of our lives, having a big language model unit like GPT-7 may be more valuable than money, and you have some productivity," Altman explained.

The release of GPT-4o may be the beginning of OpenAI's efforts in this regard.

Yes, this is just the beginning.

Finally, the video of 'Guessing May 13th's announcement.' displayed in today's OpenAI blog almost completely crashed into a warm-up video of Google's I/O conference tomorrow, which is undoubtedly a big favor for Google. I wonder if Google feels great pressure after today's OpenAI release?