torchao: PyTorch Architecture Optimization

Introduction

torchao is a library to create and integrate high-performance custom data types, layouts and kernels into your PyTorch workflows with up to 2x speedups with 65% less VRAM for inference and support for training

All with no intrusive code changes and minimal accuracy degradation.

Benchmarks

Inference

Without intrusive code changes

Quantizing your models is a 1 liner that should work on any model with an nn.Linear including your favorite HuggingFace model. You can find a more comprehensive usage instructions here and a HuggingFace inference example here

 from  torchao . quantization . quant_api  import  quantize , int4_weight_only
 m  =  quantize ( m , int4_weight_only ())

Benchmarks are run on a machine with a single A100 GPU using the script in _models/llama which generates text in a latency-optimized way (batchsize=1)

The models used were meta-llama/Llama-2-7b-chat-hf and meta-llama/Meta-Llama-3-8B .

Model	Technique	wikitext-perplexity	Tokens/Second	Memory Bandwidth (GB/s)	Peak Memory (GB)	Model Size (GB)
Llama-2-7B	Base (bfloat16)	twelve point two one two	one hundred and five point one four	one thousand three hundred and eighty-nine point three five	thirteen point eight eight	thirteen point two one
	int8dq	twelve point two six two	nine point two zero	sixty point nine three	eight point three three	six point six two
	int8wo	twelve point two zero four	one hundred and fifty point one eight	nine hundred and ninety-four point four zero	eight point nine five	six point six two
	int4wo-64	twelve point eight four three	one hundred and ninety-nine point eight six	seven hundred and forty-six point six six	four point five zero	three point seven four
	int4wo-64-GPTQ	twelve point four eight nine	one hundred and ninety-nine point eight six	seven hundred and forty-six point six six	four point five zero	three point seven four
	autoquant	twelve point two zero four	one hundred and fifty-nine point two two	one thousand and sixty-nine point eight seven	eight point nine one	six point seven two
Llama-3-8B	Base (bfloat16)	N/A	ninety-four point nine seven	one thousand four hundred and twenty-five point five five	sixteen point four three	fifteen point zero one
	int8dq	N/A	eight point four four	sixty-three point four five	eight point nine eight	seven point five two
	int8wo	N/A	one hundred and thirty-nine point seven six	one thousand and fifty-one point zero two	ten point four two	seven point five two
	int4wo-64	N/A	one hundred and seventy-nine point four four	seven hundred and fifty-seven point six zero	six point six two	four point two two
	autoquant	N/A	one hundred and thirty-seven point seven one	one thousand and thirty-seven point seven four	eleven point zero eight	seven point five four

note: Int8 dynamic quantization works best on compute bound as opposed to memory bound models. Some relatable examples might be SAM which is compute bound vs Llama at batchsize=1 which is memory bound.

For int4 we make heavy use of tinygemm of torch.ops.aten._weight_int4pack_mm to bitpack into a layout optimized for tensor cores

And a quick crash course on inference quantization to help parse the above table. Int4 quantization is an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example, if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as F.linear(input, weight.to(input.dtype)) . Dynamic quantization (DQ) primarily targets activations, enabling on-the-fly quantization from higher precision formats like bf16 to lower precision formats such as int8. This process, when supported by hardware, allows for direct computation, such as performing F.linear(input, weight) . Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of int4wo64 .

With intrusive code changes

In some cases we rewrote popular GenAI models to be significantly faster in native PyTorch as in no C++/CUDA to achieve at the time SOTA inference performance. These involve more intrusive code changes.

9.5x speedups for Image segmentation models with sam-fast compared to vanilla sam .
1.16x speedup when composing int8 quantization with 2:4 sparsity against the accelerated baseline bfloat16 dtype and torch.compile="max_autotune" .

Model Type	Technique	img/s	memory (MiB)	mIoU (coco2017 val)	relative speedup	relative accuracy
ViT-h	sam (float32, eager)	two point seven eight	twenty-eight thousand eight hundred and six	zero point five eight	baseline	baseline
	sam (bfloat16, eager)	fourteen point eight five	fourteen thousand four hundred and twenty-four	zero point five eight	5.34x	100%
	sam-fast (bfloat16, max-autotune)	twenty-two point seven five	fifteen thousand one hundred and seventy-two	zero point five eight	8.18x	100%
	int8 dynamic quant (attn + mlp)	twenty-four point nine one	fifteen thousand one hundred and fifty-four	zero point five eight	8.96x	100%
	2:4 sparsity (mlp only)	twenty-four point eight one	fifteen thousand six hundred and thirty-two	zero point five seven	8.92x	98%
	int8 dynamic quant (attn) int8 dynamic quant + 2:4 sparsity (mlp lin1) 2:4 sparsity (mlp lin2)	twenty-six point four six	fourteen thousand eight hundred and sixty-five	zero point five seven	9.52x	98%

The relative speedup is measured purely across the image encoder (ViT) of the model, where we apply our model optimizations. Benchmarks ran on an NVIDIA-A100-80GB with batch_size=32

10x speedups for Language models with gpt-fast
3x speedup for Diffusion models with sd-fast

Training

We've added support for semi-structured 2:4 sparsity with 6% end to end speedups on ViT-L

The code change is a 1 liner with the full example available here

 swap_linear_with_semi_sparse_linear ( model , { "seq.0" : SemiSparseLinear })

Newer dtypes

MX implementing training and inference support with tensors using the OCP MX spec data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.
nf4 which was used to implement QLoRA one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk here
fp6 for 2x faster inference over fp16 with an easy to use API quantize(model, fp6_llm_weight_only())

Composability

A key design principle for us is composability as in any new dtype or layout we provide needs to work with torch.compile() and needs to work with FSDP . It shouldn't matter if the kernels are written in pure PyTorch, CUDA, C++, or Triton - things should just work! And here is our current strategy

Write the dtype, layout or bit packing logic in pure PyTorch and code-generate efficient kernels with torch.compile. You can inspect those kernels with TORCH_LOGS="output_code" python your_code.py and check if a single kernel is being generated and if any unnecessary buffers are being created
However once you get a kernel, how do you know how good it is? The best way is to benchmark the compiler generated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our custom ops support so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with torch.compile() . Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that it's faster but also numerically stable so we are betting on both compilers and custom ops
Finally while historically most quantization has been done for inference, there is now a thriving area of research combining distributed algorithms and quantization. One popular example is NF4 which was used to implement the QLoRA algorithm. The NF4 tensor also contains semantics for how it should be sharded over multiple devices so it composes with FSDP. We gave an accessible talk on how to do this .

Installation

torchao makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.

Install torch

Install torch stable

 pip install torch

Or torch nightlies

 pip install --pre torch --index-url  https://download.pytorch.org/whl/nightly/cu121

Install torchao

Stable release from Pypi which will default to CUDA 12.1

 pip install torchao

Stable Release from the PyTorch index

 pip install torchao --extra-index-url  https://download.pytorch.org/whl/cu121 # full options are cpu/cu118/cu121/cu124

Nightly Release

 pip install --pre torchao-nightly --index-url  https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124

From source

 git clone  https://github.com/pytorch/ao cd ao python setup.py install

Community Contributions

jeromeku has implemented
- GaLore a drop for the Adam Optimizer that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch
- DoRA a newer replacement for QLoRA with more promising convergence characteristics
- Fused int4/fp16 Quant Matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512
gau-nernst fp6 kernels that are 4x faster than fp16 torchao/prototype/quant_llm
vayuda with generic bitpacking kernels that were code generated using pure PyTorch prototype/common
andreaskopf and melvinebenezer with 1 bit LLMs Bitnet 1.58 bitpacked into uint2 and fully code-generated with torch.compile

Blogs and Videos

How to contribute

This repository is currently under heavy development

If you have suggestions on the API or use cases you'd like to be covered, please open an issue
If you'd like to co-develop the library with us please join us on #torchao on discord.gg/cudamode - there are a lot of dtypes out there and we could use a lot more hands to make them go brrr

If you're contributing a feature to ao

 pip install -r dev-requirements.txt python setup.py develop

For most developers you probably want to skip building custom C++/CUDA extensions for faster iteration

 USE_CPP=0 python setup.py install

License

torchao is released under the BSD 3 license.

Name		Name	Last commit message	Last commit date
Latest commit History 324 Commits
.github		.github
benchmarks		benchmarks
docs		docs
packaging		packaging
scripts		scripts
test		test
torchao		torchao
tutorials		tutorials
.gitignore		.gitignore
.lintrunner.toml		.lintrunner.toml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dev-requirements.txt		dev-requirements.txt
requirements-lintrunner.txt		requirements-lintrunner.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torchao: PyTorch Architecture Optimization

Introduction

Benchmarks

Inference

Without intrusive code changes

With intrusive code changes

Training

Newer dtypes

Composability

Installation

Install torch

Install torchao

Community Contributions

Blogs and Videos

How to contribute

License

About

Releases three

Packages

Contributors forty-two

Languages

License

pytorch/ao

Folders and files

Latest commit

History

Repository files navigation

torchao: PyTorch Architecture Optimization

Introduction

Benchmarks

Inference

Without intrusive code changes

With intrusive code changes

Training

Newer dtypes

Composability

Installation

Install torch

Install torchao

Community Contributions

Blogs and Videos

How to contribute

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases three

Packages zero

Contributors forty-two

Languages

Packages