算力卡太贵，个人玩家用游戏卡跑AI可行否？

Computing cards are too expensive. Is it feasible for individual players to use game cards to run AI?

2024-06-13 17:17:32 Source: Yiling Community

Chongqing report

zero

Share to

In the current IT circle, if you don't know a little about AI, you will be abandoned by the times. But as an ordinary player, I believe that most people can only afford the cost of game graphics cards like me. It is obviously unrealistic to let me buy a piece of computing card for tens of thousands of thousands of thousands of people. Then the problem comes. I use RTX 4090 game graphics cards to do AI, What's the difference between it and Suolika, and what playing methods can be achieved?

Relative calculation force

The key is the huge gap between communication bandwidth and memory index

Take the H100/A100 launched by NVIDIA a few years ago as an example. Although we call it a computing card, its computing power advantage is not particularly obvious compared with the RTX 4090 and other game card graphics cards. This is because NVIDIA's computing power specifications play some digital games, such as the H100. Its Tensor FP16 computing power is written in 1979 Tflops, but it is the total value of sparse computing power and dense computing power.

The so-called sparse computing power means that the computing resources are idle most of the time in the process of completing tasks. This usually happens when processing low-density data or executing low complexity tasks, because most tasks can be completed in a short time, while the server's computing resources are not fully utilized when waiting for the arrival of the next task.

In contrast, dense computing power means that hardware resources are busy working most of the time in the computing process. This usually happens when processing high-density data or executing highly complex tasks. Because each task requires a large amount of computing resources and time to complete, the server's computing resources are fully utilized in the process of processing these tasks.

Obviously, for AI, dense computing power is the most important, so the real useful sensor FP16 computing power of H100 is 989Tflops. Coincidentally, the official promotion of the RTX 4090 is that the sensor core has a computing power of 1321 Tflops, but that is the int8 computing power. The FP16 computing power is only 330 Tflops, but even so, this value is already higher than the A100's 312 Tflops, so the difference in computing power is not as big as expected.

Comparison of the specifications of Suoli card and game card

H100

A100

RTX 4090

Dense computing power of Tensor FP16

989Tflops

312Tflops

330Tflops

Dense calculation force of Tensor FP32

495Tflops

156Tflops

83Tflops

Memory capacity

80GB HBM2

24GB GDDR6X

Memory bandwidth

3.35TB/s

2TB/s

1TB/s

Communication bandwidth

900GB/s SXM

64GB/s PCIe 4.0

What really opens the gap is the exaggerated communication bandwidth and memory indicators of calculators such as H100/A100. NVIDIA's Calorific Card can choose not to use the PCIe channel, but to use a dedicated SXM communication, and realize multi card interconnection through NVLink, which enables the communication bandwidth of the Calorific Card to reach an amazing 900GB/s. The RTX 4090 can only use PCIe and has cut off support for NVLink, so the current upper limit is 64GB/s.

In terms of memory performance, the computer card uses 80GB HBM2 video memory, and the video memory bandwidth can be up to 3.35TB/s, while the 24GB GDDR6X video memory bandwidth of the RTX 4090 is only 1TB/s.

Game card cannot train AI

But AI can be inferred

Lao Huang's precise "sabre technique" has always been a topic of great interest for many players. For professional GPUs with higher profits, strict performance rating is also essential. Technically speaking, large model training requires high-performance communication, and game cards, even the top RTX 4090, just cut off the communication efficiency, Because training AI often requires GPU cluster work.

Take the open source LLaMA-2-70B model of Meta AI as an example. If you use a single A100, it takes 1.7 million hours to complete a training session. If you want to train within one month, you need at least 2400 A100s. Game cards are not designed for clusters like professional calculators. Even if you give you more than 2000 RTX 4090s, you can't connect them, In addition, the game card does not have the use license of the data center, so it cannot be replaced at the root.

Artificial intelligence training requires parallel computing of multiple graphics cards, and game cards are "congenitally deficient" in this respect

In addition, AI training requires massive data to be stored in video memory. Obviously, a single 80GB video memory computing card has reduced the dimension of the game card from the specification. You need to use multiple game cards to achieve similar video memory capacity. At the same time, the graphics memory of the force calculation card supports ECC fault tolerance, which can effectively reduce the failure rate, and low failure rate is the basis for maintaining the force output.

Since AI cannot be trained, what can game cards do? As all friends who have used Stable Diffusion know, when running local Wensheng map applications, the efficiency advantage of high-performance game graphics cards is very obvious, that is to say, when used correctly, game cards are good at reasoning AI.

The reason why we emphasize "correct use" is that the capacity of video memory can easily become a bottleneck. The current AI reasoning, whether using pipeline parallel or tensor parallel, may lead to efficiency problems in memory bandwidth, not to mention the need to write models and caches into video memory during computing, Therefore, many local AI applications require players to set their own video memory capacity in advance in order to maximize the use efficiency, which is also the key to RTX 4090, a large memory video game card, which is very suitable for running local AI reasoning.

Special statement: The above content (including pictures or videos, if any) is uploaded and released by users of "Netease" on our media platform, and this platform only provides information storage services.

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.

mobile phone / Digital

house property / Home Furnishing

Computing cards are too expensive. Is it feasible for individual players to use game cards to run AI?

GAM zero R7, advance to the race! Netizen: TES, are you afraid?

The man reported Yang Yongxin for more than one year and received a reply. He was once ready to commit suicide

The man reported Yang Yongxin for more than one year and received a reply. He was once ready to commit suicide

China Open is very cold! Zhang Shuai swept No. 6 seed Navarro 2-0

Zhang Jie and Xie Na Celebrate the 13th Anniversary of Their Marriage and Express Their Super Romance

Witness history! How to go after A-share continues to soar?

Why are there few recalls of domestic electric vehicles?

Standard electronic exterior rear-view mirror Dongfeng Honda Lingxi L sold 129800 yuan

Original attitude

NVIDIA RTX 5090/5080 specification exposure performance gap further widened

Shantou Nan'ao Island, once isolated from the rest of the world, has become one of the top ten popular scenic spots on National Day

Traveling in China | Who else doesn't know that Jingmen people have their own Xianbena

Last year, Maillard was popular. What is popular this year? The answer is --!