Why is it difficult to train large models?

Experience it yourself. In my opinion, it is difficult to collect data, find technical paths and evaluate the generated results in training large models.

collecting data

First of all, the data required for training the large model is massive, and there are several general ways to obtain it:

  • Original data of own business (not enough)
  • Partner data (white P idea)
  • Purchase data (sky high price, and sometimes there is price but no market)
  • Community oriented data collection (data quality varies)
  • Batch generation of data (not practical enough)

These data need manual cleaning and additional annotation, and the huge amount of data needs special annotation tools. How to understand and balance the distribution of data from different data sources will affect the final results.

Find the technical path

Many gourd eaters will think that training large models is like cooking. With raw materials and recipes, they can only stir a few spatulas to make a pot. In fact, the process of training is more like looking for treasure in the jungle. The compass is the top talent, and computing power is the shovel, It is necessary to constantly try and correct errors and find the optimal combination of algorithms and data from one branch road after another.

Facts have proved that even if it is better than Google, there will be times when it is not in the right place. For example, after OpenAI Sora came out, other teams of Wensheng Video suddenly realized that DiT architecture was used for benchmarking.

Because of the feature that large models will emerge only at a certain scale, you can't do experiments on small models at a low cost.. Once the training, the cost will crash...

According to personal estimation, if you know the general direction of technology and use the same data to train, you can save 60-90% of the training cost. However, there are still many errors in details, such as data source, model parameter selection, and some strange processing methods.

Evaluation generation results

We don't need to mention that each language model is specialized in ranking. Even internally, there is no particularly good quantitative evaluation method for the model. A lot of papers still stay in making user studies to judge visually (yes, I scold myself). Last year, Y Combinator talked about a startup company that specializes in model evaluation solutions. I don't know what's going on now.

To sum up, the training model is probably the legendary one. It's difficult at the beginning, in the middle and at the end.

Comment area

  1. Richie March 1, 2024 reply

    Wow, it's rare to see articles that can explain the reason in a simple way.

Post reply

Your email address will not be disclosed.