from torchao . quantization . quant_api import quantize , int4_weight_only
m = quantize ( m , int4_weight_only ())
|
|
|
|
|
|
|
---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-
9.5x speedups for Image segmentation models with sam-fast compared to vanilla sam . -
1.16x speedup when composing int8 quantization with 2:4 sparsity against the accelerated baseline bfloat16 dtype and torch.compile="max_autotune" .
|
|
|
|
|
|
|
---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
swap_linear_with_semi_sparse_linear ( model , { "seq.0" : SemiSparseLinear })
-
MX implementing training and inference support with tensors using the OCP MX spec data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet. -
nf4 which was used to implement QLoRA one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk here -
fp6 for 2x faster inference over fp16 with an easy to use API quantize(model, fp6_llm_weight_only())
-
Write the dtype, layout or bit packing logic in pure PyTorch and code-generate efficient kernels with torch.compile. You can inspect those kernels with TORCH_LOGS="output_code" python your_code.py and check if a single kernel is being generated and if any unnecessary buffers are being created -
However once you get a kernel, how do you know how good it is? The best way is to benchmark the compiler generated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our custom ops support so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with torch.compile() . Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that it's faster but also numerically stable so we are betting on both compilers and custom ops -
Finally while historically most quantization has been done for inference, there is now a thriving area of research combining distributed algorithms and quantization. One popular example is NF4 which was used to implement the QLoRA algorithm. The NF4 tensor also contains semantics for how it should be sharded over multiple devices so it composes with FSDP. We gave an accessible talk on how to do this .
pip install torch
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
pip install torchao
pip install torchao --extra-index-url https://download.pytorch.org/whl/cu121 # full options are cpu/cu118/cu121/cu124
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124
git clone https://github.com/pytorch/ao cd ao python setup.py install
-
jeromeku has implemented -
GaLore a drop for the Adam Optimizer that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch -
DoRA a newer replacement for QLoRA with more promising convergence characteristics -
Fused int4/fp16 Quant Matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512
-
-
gau-nernst fp6 kernels that are 4x faster than fp16 torchao/prototype/quant_llm -
vayuda with generic bitpacking kernels that were code generated using pure PyTorch prototype/common -
andreaskopf and melvinebenezer with 1 bit LLMs Bitnet 1.58 bitpacked into uint2 and fully code-generated with torch.compile
-
Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity -
https://mobiusml.github.io/whisper-static-cache-blog/ -
Slaying OOMs at the Mastering LLM's course -
Advanced Quantization at CUDA MODE -
Chip Huyen's GPU Optimization Workshop
-
If you have suggestions on the API or use cases you'd like to be covered, please open an issue -
If you'd like to co-develop the library with us please join us on #torchao on discord.gg/cudamode - there are a lot of dtypes out there and we could use a lot more hands to make them go brrr
pip install -r dev-requirements.txt python setup.py develop
USE_CPP=0 python setup.py install