In the current IT circle, if you don't know a little about AI, you will be abandoned by the times. But as an ordinary player, I believe that most people can only afford the cost of game graphics cards like me. It is obviously unrealistic to let me buy a piece of computing card for tens of thousands of thousands of thousands of people. Then the problem comes. I use RTX 4090 game graphics cards to do AI, What's the difference between it and Suolika, and what playing methods can be achieved?
01
Relative calculation force
The key is the huge gap between communication bandwidth and memory index
Take the H100/A100 launched by NVIDIA a few years ago as an example. Although we call it a computing card, its computing power advantage is not particularly obvious compared with the RTX 4090 and other game card graphics cards. This is because NVIDIA's computing power specifications play some digital games, such as the H100. Its Tensor FP16 computing power is written in 1979 Tflops, but it is the total value of sparse computing power and dense computing power.
The so-called sparse computing power means that the computing resources are idle most of the time in the process of completing tasks. This usually happens when processing low-density data or executing low complexity tasks, because most tasks can be completed in a short time, while the server's computing resources are not fully utilized when waiting for the arrival of the next task.
In contrast, dense computing power means that hardware resources are busy working most of the time in the computing process. This usually happens when processing high-density data or executing highly complex tasks. Because each task requires a large amount of computing resources and time to complete, the server's computing resources are fully utilized in the process of processing these tasks.
Obviously, for AI, dense computing power is the most important, so the real useful sensor FP16 computing power of H100 is 989Tflops. Coincidentally, the official promotion of the RTX 4090 is that the sensor core has a computing power of 1321 Tflops, but that is the int8 computing power. The FP16 computing power is only 330 Tflops, but even so, this value is already higher than the A100's 312 Tflops, so the difference in computing power is not as big as expected.
Comparison of the specifications of Suoli card and game card
H100
A100
RTX 4090
Dense computing power of Tensor FP16
989Tflops
312Tflops
330Tflops
Dense calculation force of Tensor FP32
495Tflops
156Tflops
83Tflops
Memory capacity
80GB HBM2
80GB HBM2
24GB GDDR6X
Memory bandwidth
3.35TB/s
2TB/s
1TB/s
Communication bandwidth
900GB/s SXM
900GB/s SXM
64GB/s PCIe 4.0
What really opens the gap is the exaggerated communication bandwidth and memory indicators of calculators such as H100/A100. NVIDIA's Calorific Card can choose not to use the PCIe channel, but to use a dedicated SXM communication, and realize multi card interconnection through NVLink, which enables the communication bandwidth of the Calorific Card to reach an amazing 900GB/s. The RTX 4090 can only use PCIe and has cut off support for NVLink, so the current upper limit is 64GB/s.
In terms of memory performance, the computer card uses 80GB HBM2 video memory, and the video memory bandwidth can be up to 3.35TB/s, while the 24GB GDDR6X video memory bandwidth of the RTX 4090 is only 1TB/s.
02
Game card cannot train AI
But AI can be inferred
Lao Huang's precise "sabre technique" has always been a topic of great interest for many players. For professional GPUs with higher profits, strict performance rating is also essential. Technically speaking, large model training requires high-performance communication, and game cards, even the top RTX 4090, just cut off the communication efficiency, Because training AI often requires GPU cluster work.
Take the open source LLaMA-2-70B model of Meta AI as an example. If you use a single A100, it takes 1.7 million hours to complete a training session. If you want to train within one month, you need at least 2400 A100s. Game cards are not designed for clusters like professional calculators. Even if you give you more than 2000 RTX 4090s, you can't connect them, In addition, the game card does not have the use license of the data center, so it cannot be replaced at the root.
Artificial intelligence training requires parallel computing of multiple graphics cards, and game cards are "congenitally deficient" in this respect
In addition, AI training requires massive data to be stored in video memory. Obviously, a single 80GB video memory computing card has reduced the dimension of the game card from the specification. You need to use multiple game cards to achieve similar video memory capacity. At the same time, the graphics memory of the force calculation card supports ECC fault tolerance, which can effectively reduce the failure rate, and low failure rate is the basis for maintaining the force output.
Since AI cannot be trained, what can game cards do? As all friends who have used Stable Diffusion know, when running local Wensheng map applications, the efficiency advantage of high-performance game graphics cards is very obvious, that is to say, when used correctly, game cards are good at reasoning AI.
The reason why we emphasize "correct use" is that the capacity of video memory can easily become a bottleneck. The current AI reasoning, whether using pipeline parallel or tensor parallel, may lead to efficiency problems in memory bandwidth, not to mention the need to write models and caches into video memory during computing, Therefore, many local AI applications require players to set their own video memory capacity in advance in order to maximize the use efficiency, which is also the key to RTX 4090, a large memory video game card, which is very suitable for running local AI reasoning.
Special statement: The above content (including pictures or videos, if any) is uploaded and released by users of "Netease" on our media platform, and this platform only provides information storage services.
Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.