LM #6 | Quantization

2025-07-04 3. Natural Language Comments

Overview

Quantization : is the process of reducing the precision of model weights and/or activations (e.g., FP16 → INT8/INT4) to make LLMs smaller, faster, and cheaper to run.
- Lower VRAM / RAM usage (e.g., 120B FP16 → ~240 GB → INT4 ~30 GB)
- Faster inference (less memory bandwidth & cache pressure)
- Enables larger models on smaller hardware
- Cheaper to serve
Taxonomies :
- Post Training Quantization (PTQ) : Quantize a pre-trained model without additional training.
  - Pros: Fast, no training needed
  - Cons: Accuracy drop for extreme quantization (INT2/INT3/INT4)
  - Methods:
    - RTN (Round-To-Nearest) : Simple, fast, lower accuracy.
    - GPTQ : Blockwise quantization with error compensation; popular for INT4.
    - AWQ (Activation-aware Weight Quantization) : Preserves outlier channels → better for very large models.
- Quantization Aware Training (QAT) : Simulate quantization during training or finetuning. You train the model with “fake quantization nodes” inserted in forward pass.
  - Pros: Best accuracy under very low bits
  - Cons: Slower, needs GPU training, harder to set up
- Training-time Quantization (e.g. MXFP4, FP4) : Model is trained directly in low precision
  - Pros: Lowest memory use during training
  - Cons: Requires special kernels / hardware support
  - Methods:
    - MXFP4 (mixed 4-bit floating point) → used in GPT-OSS
    - NF4 (Normal Float 4) → QLoRA
    - FP8 training (H100/A100)
Related Works
- NVIDIA/Model-Optimizer (ModelOpt): a library comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
  - For PTQ, QAT, Pruning, Distillation, Speculative Decoding, …