Overview
- Quantization : is the process of reducing the precision of model weights and/or activations (e.g., FP16 → INT8/INT4) to make LLMs smaller, faster, and cheaper to run.
- Lower VRAM / RAM usage (e.g., 120B FP16 → ~240 GB → INT4 ~30 GB)
- Faster inference (less memory bandwidth & cache pressure)
- Enables larger models on smaller hardware
- Cheaper to serve
- Taxonomies :
- Post Training Quantization (PTQ) : Quantize a pre-trained model without additional training.
- Pros: Fast, no training needed
- Cons: Accuracy drop for extreme quantization (INT2/INT3/INT4)
- Methods:
- RTN (Round-To-Nearest) : Simple, fast, lower accuracy.
- GPTQ : Blockwise quantization with error compensation; popular for INT4.
- AWQ (Activation-aware Weight Quantization) : Preserves outlier channels → better for very large models.
- Quantization Aware Training (QAT) : Simulate quantization during training or finetuning. You train the model with “fake quantization nodes” inserted in forward pass.
- Pros: Best accuracy under very low bits
- Cons: Slower, needs GPU training, harder to set up
- Training-time Quantization (e.g. MXFP4, FP4) : Model is trained directly in low precision
- Pros: Lowest memory use during training
- Cons: Requires special kernels / hardware support
- Methods:
- MXFP4 (mixed 4-bit floating point) → used in GPT-OSS
- NF4 (Normal Float 4) → QLoRA
- FP8 training (H100/A100)
- Related Works
- NVIDIA/Model-Optimizer (ModelOpt): a library comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
- For PTQ, QAT, Pruning, Distillation, Speculative Decoding, …