Ask Nvidia's little brother to interpret Sora's real technological breakthrough | Sweet House

Sora's official technical articles talked about a lot, but I remember that it was not his original idea to use LLM's idea to generate videos. So I asked Nvidia's research brother and wrote a research note for this clue.

Previous video generation methods

Recurrent networks (RNNs)
Generative adversarial networks (GANs)
Autoregressive transformers
Diffusion models

Previous methods can only generate specific visual categories or shorter, fixed resolution videos. The RNN and GAN methods can be ignored due to their poor effects.

LLM for Videos?

Shortly after Sora's official announcement, OpenAI immediately published a technical article saying that, inspired by the big language model, Sora adopted similar visual patches.

Due to the limited space, their article is greatly simplified and does not mention many academic contexts. The idea of using a visual transformer to process videos and word breakers is not new. The ViViT proposed by Google in 2021 is a video classification model based on visual transformer.

Therefore, I asked Nvidia AI researchers about Sora's novelty. He said that in academia, Researchers have been debating who is better in the aspect of cultural videos between the ViT and UNet architectures. In recent years, it seems that ViT has become the model of mainstream visual architecture, but UNet still dominates the field of diffusion model. From DALL · E 2 to Stable Diffusion, UNet is widely used in the liberal arts visual model.

In 2023, Google proposed MAGVIT, which uses generic markup vocabulary to generate concise and expressive coding for videos and images. The paper of MAGVIT v2 is called Language Model Beats Diffusion: Tokenizer is Key to Visual Generation, which is translated as Big Language Model Beats Diffusion Model: Word Segmenter is the Key to Visual Generation. The title reflects the reality of "war". This model was later integrated into Google's VideoPool, a large model for zero shot video generation.

From this perspective, Sora proved DiT's technical path The innovation is still very big. Now let's talk about what DiT is.

network structure

In the paper Scalable Diffusion Models with Transformers, a neural network structure called DiT is proposed, which combines the advantages of visual transformer and diffusion models.

DiT = VAE encoder + ViT + DDPM + VAE decoder

ICCV 2023 paper proposes a diffusion model with a transformer Backbone. ( https://arxiv.org/pdf/2212.09748.pdf )

As for the application of DiT in Sora, Saining Xie, one of the DiT authors, mentioned in his tweet:

Derived from the calculation related to the batch size, Sora may have about 3 billion parameters. "Training the Sora model may not require as many GPUs as people expect; I expect very fast iterations in the future."
Sora "may also use Google's Patch n'Pack (NaViT) paper results to enable it to adapt to variable resolution/duration/aspect ratio."

video compression

According to Sora official blog, video training is compressed into compact space-time coding; Then there is a decoder model that can reverse generate the generated code into a format expressed in pixels.

Saining commented: "It seems that this is a VAE structure, plus training with original video data."

Word understanding

Sora official blog revealed that they trained a model to generate subtitles corresponding to videos, which was first proposed in the paper of DALL · E 3. This greatly improves Sora's ability to understand user input text and overall video quality.
The team also used GPT to convert users' short text prompts into detailed titles. Rewriting prompts is almost a standard practice in AI products today to bridge the gap between user instructions and model behavior.

Training data

It is speculated that Sora's training involves extra video rendered with a 3d engine. The extensive use of synthetic data must have played an important role in Sora's training.
Sora trains the video in the original aspect ratio to obtain better composition and shooting.

Takeaways

Sora has proved that the rule of striving for miracles is also applicable to video generation, Using the DiT model and the concept of token coding, we achieved amazing results. We can see that Sora can move the camera smoothly, maintain the consistent appearance of objects, remember where the objects are, and make the objects in the video interact.

A major progress of Sora is that it can create long videos; The technical path of producing 5 videos and 1 minute videos is very different. Before Sora, researchers wanted to know whether long video generation needed a specific category or even complex physical simulator. Sora told us, End to end general model training can achieve the generation of long videos.

The breakthrough in video generation will also help in many other fields, such as 3D generation, autonomous driving and robotics, and eventually can simulate the physical world.

GAIA-1 model can synthesize visual information of various roads to help model training in the field of automatic driving. ( https://arxiv.org/pdf/2309.17080.pdf )

The next challenge for video generation then becomes how to solve the problem of error accumulation and ensure video quality and consistency over time.

When Sora is officially launched, we look forward to testing it with our own hands and getting more conclusions~

Ask Nvidia's younger brother to interpret Sora's real technological breakthrough