Application of Model Quantification and Quantification in LLM | Material Extraction Technology

original
04/30 10:10
Reading 5.4K

I Model reasoning optimization

With the landing practice of models in various scenarios, the reasoning acceleration of models has already become an important part of AI engineering. In recent years, the large model based on Transformer architecture has become the mainstream, and has achieved SoTA achievements in various tasks. Their high cost in training and reasoning makes their deployment practice at reasonable cost more important.

The challenges faced by large model reasoning mainly include the following two points:

  • The huge demand for memory (video memory) mainly comes from the real-time demand for model parameters and reasoning.
    • For a model of LLaMA2-30B, loading the model into the video memory itself requires about 60GiB of video memory. During reasoning, the KV cache of a single token requires about 1.6MiB of video memory: 6656 (layer dim) * 52 (layer num) * 2 (K&V) * 2 (fp16, 2bytes); For a 2048 token request, 3.3GiB of video memory is required.
  • The parallelism is poor, because the generation process is usually a serial process in time sequence, which makes the decoding process difficult to be parallel and becomes the bottleneck of computing.

Common reasoning and optimization methods include Knowledge Distillation (KD), Pruning and Quantization, as well as various schemes proposed for LLM memory optimization (such as Flash Attention, Paged Attention, etc.).

Distillation refers to the direct construction of small models as student models, and the supervision and learning of the knowledge of the original model through the combination of soft tags and the original tags, so that the small models have the same performance as the original models, and finally replace the large models with small models to improve the reasoning efficiency.

 

[Image source: Knowledge Disclosure: A survey, 2021, p2]

Pruning is to "slim down" the model by cutting off the unimportant weights in the model, and improve the reasoning efficiency of the model. In order to ensure the ability of the model, the pruning process usually needs to be accompanied by fine-tuning of the model based on training data. According to the different dimensions of pruning weight, it can be divided into structured pruning and unstructured pruning.

  • Structural pruning: usually, unimportant channels are pruned in blocks according to one or more dimensions of the weight tensor, and normal matrix multiplication is maintained; However, the logic accuracy of the network needs to be checked because the cut channels affect the reasoning of the upper and lower layers.
  • Unstructured pruning: randomly prune the unimportant elements in the weight tensor, so it usually keeps the original weight structure, resulting in sparse multiplication calculation, but it is not suitable for general hardware, so special hardware is required to achieve acceleration.

At present, pruning is rarely used in LLM. For example, the following pruning work based on Activation aware [1] is mainly based on the absolute value of the weight itself and the absolute value of the input tensor to do unstructured pruning, so that the weight tensor itself is sparse, and the accuracy loss of the model cannot meet the requirements of engineering.

 

[Image source: A simple and effective practicing approach for large language models, 2021, p2]

As shown in the recent structural pruning work [2] in the figure below, the substructure in the model is searched by search method, and the model accuracy is maintained by retraining. The precision of the pruned model is greatly reduced compared with the original model, and can only be compared with other smaller models with the same parameter amount (after pruning) to show the significance of the method.

 

[Image source: Sheared LLaMA: accelerating language model pre training via structured pruning, 2023, p3]

 

[Image source: huggingface/Sheared-llama-1.3B]

The main advantages of quantification as the first choice of neural networks and LLM are as follows:

  • Reduce the visual representation of video memory.
    • Generally, the LLM weight reuses FP16 storage, and after the weight is changed to int4, the volume is intuitively reduced to 1/4 of the original size (in fact, it may be slightly more due to the non quantification of embeddings, memory allocation and other reasons), and the resource requirements for video memory are greatly reduced.
  • W4A16, W8A16, etc.

II Quantitative Introduction

base

The essence of quantification is usually to convert the parameters of the model or the reasoning process of the whole model from floating point to integer.

Quantization parameters are usually composed of scale and zero point values. The former is floating point, and the latter is integer. Let x be a tensor (it can be a weight or an intermediate variable of reasoning), and its quantitative process can be expressed as follows:,

 

Use b to represent the quantization bit width, and q {min} and q {max} respectively represent the range of integer value range. For example, int-8 quantization can take [- 128127], that is, q {min}=- 2 ^ (b-1)=- 128, q {max}=2 ^ (b-1) - 1=127, clamp (a; Q {min}, q {max}) represents the truncation operation of the input value a based on the range of [q {min}, q {max}], x {int} represents the quantized result, s and z represent the quantized parameters scale and zero point.

 

 

[Image source: A Survey of Quantification Methods for Efficient Neural Network Inference, 2021, p5;An Introduction to Quantization of Large Language Models,p12】

The inverse quantization process from integer to floating point is as follows:,

 

For quantization parameters, there are many algorithms based on search, optimization, LKD (layer by layer distillation) and other algorithms to calculate the optimal solution, so as to minimize the accuracy loss caused by quantization; The most direct calculation of scale and method is based on tensor element min/max.

 

The following is a simple code showing an example of quantizing tensor x from fp32 to int8 integer, and then back to fp32:

x->x {int}- >An example of the process of x_hat is as follows:

 

X before quantification:

 

Quantified x_hat:

 

Symmetry/Asymmetry

Compared with asymmetric quantization, the definition of symmetric quantization is that the integer range mapped by quantization is symmetric based on 0 value, that is, the zero point of the above formula is 0, and qmax=- qmin, thus simplifying the expression of quantization.

Asymmetric quantization is conducive to making full use of the quantization range. For example, the values of the excitation tensors output by Conv+ReLU are all positive. If symmetric quantization is used, the floating points will all be mapped to the range [0~127], and half of the range is unused, and its quantization accuracy is not as good as asymmetric quantization.

 

[Image source: A Survey of Quantification Methods for Efficient Neural Network Inference, 2021, p5]

In practice, the weight tensor is often symmetrically quantized, while the input tensor is asymmetrically quantized. The following is the analysis in the quantization white paper from qualcomm. For example, when asymmetric quantization is selected for both weights and inputs, the matrix multiplication of the Linear layer is taken as an example, and the expression is expanded as follows:

 

  • The first is the multiplication operation of integer tensor, which is a necessary instant operation;
  • The third and fourth operations include the multiplication of scale, zero and integer weights, which are predicted in advance, so they can be added as offsets in advance;
  • The calculation of the second term depends on x {int}, which needs to be calculated immediately for each reasoning, and this will cause additional computing power.

Therefore, when we change the weight quantization to symmetric quantization (zW=0), the above formula is simplified as follows. In real-time calculation, only the matrix multiplication of the first term needs to be calculated, and the second term is the pre calculated offset term:

 

When both are expressions of symmetric quantization, they are simplified as follows:

 

Compare the floating point calculation in the original model W {x}, W {int}x {int} is the multiplication between integer and integer. The operation speed of the latter on Nvidia GPU is much faster than the former, which is the reason why the reasoning speed of the quantitative model is greatly accelerated.

III Quantification of LLM

Challenges in LLM Quantization

From the perspective of model performance, one of the preconditions for quantification is how to maintain the accuracy of the quantified model, that is, let the users of the model feel that the quantified model can maintain its original performance while improving its reasoning efficiency.

The operations that need to be quantified in the neural network are mainly convolution layer Conv (x; W) and full connection layer Wx, that is, the weight quantification (WQ) and activation quantification (AQ) of W and x according to the operations described in the previous section.

Unlike the CNN model or the small Transformer model, the excitation tensor generated by the matrix multiplication based on the Transformer's large model usually has more outliers, that is, values far away from the point group formed by most of the points in the value distribution. These element values with larger absolute values but lower proportions increase the difficulty of quantification. However, how to choose or reject outliers is usually a difficult point in the quantification work. If it is considered too much, the expression range of quantification will be reduced due to the excessive quantification range. If it is truncated too much, the results of model reasoning will be greatly affected by the values with large absolute values, which will lead to the deterioration of model effect. The quantification of the latter is particularly obvious in LLM.

The figure below shows the element value statistics of the input tensors of Resnet18 and Opt-13B at a certain level respectively. Sigma represents the standard deviation of their respective distributions. The maximum value of Resnet18 input is about 28 sigma, and the proportion beyond the absolute value of 6 sigma is 0.05%; The maximum value of Opt-13B network input is 325 sigma, and the proportion beyond the absolute value of 6 sigma is 0.2%. In terms of quantitative effect, the precision of int-8 of Resnet18 is basically no loss, while the precision of int-8 model of Opt-13B has collapsed.

 

[Source: An Introduction to Quantification of Large Language Models, p20]

In response to the challenge of incentive quantification, some schemes try to reduce the quantification accuracy, such as SmoothQuant's idea.

 

 

[Image source: SmoothQuant, p4]

In matrix multiplication, they compensate the reduced proportion to the weight tensor W by scaling down the value of the input tensor X, that is, they transform the problem from quantizing X and W to quantizing X · diag (s ^ (- 1)) and diag (s) · W. Thus, the quantization difficulty of tensor X is reduced on the premise that the product of multiplication operation remains unchanged. In practical engineering, the quantization error caused by this quantization scheme still has a significant impact on the reasoning effect of the large model, even in the int-8 precision quantization, there is also a significant error. For example, the following application results of SmoothQuant in Llama2-7B show that its plexity is very poor and difficult to apply in practice.

 

Therefore, most of the practical schemes in the current project deployment are weight only quantitative schemes, that is, the quantification of activation is abandoned.

GPTQ

GPTQ is the first quantitative scheme accepted by engineering deployment. The quantitative effect of W8A16 or W4A16 is similar to the original model in most scenarios, and its quantitative process is very fast.

Quantification process

Taking the basic unit operation of matrix multiplication as an example, based on the mean square deviation of the product before and after weight only quantization, the following optimization function can be written,

 

W is the linear layer weight in Transformer, and X represents its corresponding input. The process of offline quantization is to quantify module by module (Transformer) layer by layer (Q, K, V, O, Fc1, Fc2).

Parameters and data are defined as follows:

  • W∈R^{K×M},X∈R^{M×N},Y=W×X∈R^{K ×N}
  • Calibrate set: part of the data is used for reasoning, which is used to view the value range of the input tensor of each layer and quantify based on this.

The specific quantification process is as follows:

  • Calculate Hessian (the above optimization function is for Hessian of W_hat instead of Hessian in back-propagation), and add disturbance term:

 

  • Act order sort (desc_act, where columns with similar value ranges are quantified together). Column rearrangement of W based on M dimension is performed based on diag (H). Similarly, H is rearranged on two dimensions correspondingly.
  • Inverse H ^ (- 1) (cholesky decomposition).
  • The W is quantized block by block from left to right along the dimension M, block size B=128, and the unquantized part on the right side is updated based on H ^ (- 1) to compensate for the quantization loss.

 

  • (inner loop) Quantize each block column by column, calculate the error, and update the unquantified columns in the block based on the error.

 

 

  • (outer loop) After operating the block, update all the following columns:

 

group_size

  • If the group size is not specified, g=- 1 is the default. Quantization parameters are counted in all columns, and the weight of each row is quantified. For W ∈ R ^ {K × M}, the number of quantization parameters is K × 1.

 

  • If the group size is specified, for example, g=128, the quantization parameters will be counted in every 128 columns, and the weight of each row will be quantified. For W ∈ R ^ {K × M}, the number of quantization parameters is K × (M/g).

 

Rearrange desc_act

According to Hessian Matrix H, columns of W are rearranged based on M dimension based on diag (H). Its purpose is to prioritize the quantization of the columns of weight corresponding to the activaiton with a large absolute value. These columns are considered to be more important columns that affect the results in reasoning. Therefore, it is hoped that smaller errors will be generated when quantifying these columns, and more quantization errors will be transferred to the columns that are not important to the back.

Some experiments show that desc_act's effect on quantizing loss is effective in most tasks.

 

Perplexity of Pygmalion-7B with GPTQ [7]

[Image source: https://huggingface.co/reeducator/vicuna-13b-free/discussions/22

operator

Strictly speaking, the weight only W4A16 does not improve much in efficiency compared with the original W16A16, and the reasoning also adds the quant/equal process; However, as weight only becomes the mainstream of LLM quantification and more and more applications are available, there are many open source works based on W4A16 efficient operator writing to speed up the reasoning of quantification algorithms, such as GPTQ's python package   AutoGPTQ It has been integrated into the open source tool exllama, which rewrites the parallel computing of quantitative multiplication based on triton and CUDA. stay
Exllama/exllama_ext/matrix. cuh You can see that dot_product8_h pair out=W_hat · x=(W {int}-z )s·x=(W {int}-z )X · s implementation.

 

[Image source: https://github.com/turboderp/exllama/blob/3b013cd53c7d413cf99ca04c7c28dd5c95117c0d/exllama_ext/matrix.cuh#L86

AWQ

Compared with GPTQ design scheme based on optimization, AWQ is a quantitative scheme based on search.

Use Q (·) to represent the quantization inverse quantization process, and the quantization process before modification is as follows:

 

After modification, the quantization process is as follows, and the scaling of W is added:

 

search

The full name of AWQ is Activation aware Weight Quantization, that is, the influence of the value of Activation is taken into account in the weight quantification process. The starting point is also based on the fact that in each channel of Weight, the channel with larger corresponding activation value is relatively important, otherwise it is relatively unimportant, and its importance is reflected by multiplying a scaling coefficient Δ, while the value and range of Δ are designed by the tensor value of the input activation.

 

The search criterion is based on the comparison of the output results before and after the linear layer quantization, and the minimum MSE result is taken as the optimal solution.

 

effect

From the perspective of model performance effect, the optimal scaling coefficient is found through scale search layer by layer, so as to obtain the solution with the smallest quantization error. The following effect comparison from AWQ paper shows that the quantization result of the two generations of Llama is slightly better than GPTQ and GPTQ ranking version from the perspective of Perplexity.

 

[Image source: AWQ, p6]

From the perspective of the accuracy rate of actual tasks, the accuracy rate of AWQ is equivalent to that of the act_order version of GPTQ (GPTQ-R), while the speed is better than the latter.

 

[Image source: AWQ, p5]

In terms of the calculation performance of the model, GPTQ has reorder operation, and the matrix multiplication is MV (matrix × vector), which is discontinuous memory access, while AWQ does not have reorder operation, and the matrix multiplication is (matrix × matrix), which is faster.

IV summary

The current SOTA performance on LLM quantization is basically based on the weight only quantization mode. The reduction of video memory required for the model to run in the GPU is its main contribution.

From the perspective of model performance, because there are inevitable quantitative losses, and LLM models are usually much more sensitive to quantification than traditional CNN models, although there is little difference between the quantified LLM performance and the pre quantified LLM performance on many tasks, it may still be unable to succeed on some tasks.

From the perspective of model acceleration, the work of weight only quantization to accelerate the bottom layer is basically accelerated on the W4A16, W3A16, W8A16 and other multiplication operators. From the theoretical data provided on the paper, it is generally only 1. x~3. x times faster than the FP16 model, while the actual deployment effect may be lower than this value, and its acceleration effect is far less than that of the W4A4 W8A8 and other all integral multiplication operators.

In general, the quantification work in the LLM field is still very preliminary. If the performance accuracy of the model is required to be very high in actual tasks, we recommend algorithms and tools to improve the unit video memory throughput based solely on KV cache and other directions, such as Flash Attention - 2, Paged Attention, etc.

V Reference

1. A Simple and Effective Pruning Approach for Large Language Models, 2023.

2. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, 2023.

3. A White Paper on Neural Network Quantization, 2021.

4. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, 2023.

5. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2023.

6. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, 2023.

7. Some evaluation on GPTQ performance.

 

*Article/ xujiong

This article is original in technology. For more wonderful articles, please see: Official website of Dewu Technology

It is strictly prohibited to reprint without obtaining the technical license of the object, or legal responsibility will be investigated according to law!

Expand to read the full text
Loading
Click to join the discussion 🔥 (1) Post and join the discussion 🔥
one comment
eight Collection
one fabulous
 Back to top
Top