LM #6 | Quantization


Overview

  • Quantization : is the process of reducing the precision of model weights and/or activations (e.g., FP16 → INT8/INT4) to make LLMs smaller, faster, and cheaper to run.
    • Lower VRAM / RAM usage (e.g., 120B FP16 → ~240 GB → INT4 ~30 GB)
    • Faster inference (less memory bandwidth & cache pressure)
    • Enables larger models on smaller hardware
    • Cheaper to serve
  • Taxonomies :
    • Post Training Quantization (PTQ) : Quantize a pre-trained model without additional training.
      • Pros: Fast, no training needed
      • Cons: Accuracy drop for extreme quantization (INT2/INT3/INT4)
      • Methods:
        • RTN (Round-To-Nearest) : Simple, fast, lower accuracy.
        • GPTQ : Blockwise quantization with error compensation; popular for INT4.
        • AWQ (Activation-aware Weight Quantization) : Preserves outlier channels → better for very large models.
    • Quantization Aware Training (QAT) : Simulate quantization during training or finetuning. You train the model with “fake quantization nodes” inserted in forward pass.
      • Pros: Best accuracy under very low bits
      • Cons: Slower, needs GPU training, harder to set up
    • Training-time Quantization (e.g. MXFP4, FP4) : Model is trained directly in low precision
      • Pros: Lowest memory use during training
      • Cons: Requires special kernels / hardware support
      • Methods:
        • MXFP4 (mixed 4-bit floating point) → used in GPT-OSS
        • NF4 (Normal Float 4) → QLoRA
        • FP8 training (H100/A100)
  • Related Works
    • NVIDIA/Model-Optimizer (ModelOpt): a library comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.
      • For PTQ, QAT, Pruning, Distillation, Speculative Decoding, …