LM #3 | Language Model Fine-tuning


Introduction

  • Overview (SFT)
    • LoRA
      • LoRA
      • QLoRA
      • DoRA
      • LoRA+
    • PO (Preference Optimization)
      • RLHF (reinforcement learning from human feedback)
      • RLAIF
      • DPO ( direct preference optimization)


SFT : Training language models to follow instructions with human feedback (2022)

  • Introduction
    • Large language models (LLMs) are powerful but often fail to follow human instructions reliably. Simply scaling models doesn’t solve this.
    • The paper proposes an alignment-focused training method where a base model is first fine-tuned on human-written responses (SFT), then optimized to match human preferences using RLHF.
  • Methods
    • Supervised Fine-Tuning (SFT): Human labelers write ideal responses to prompts. And fine-tune the base GPT-3 model on this data.
    • Training Reward Model: Multiple model outputs are shown to human evaluators. Evaluators rank them, and train a reword model to predict those rankings (human preferred output).
    • Reinforcement Learning (PPO): The SFT model is further tuned using Proximal Policy Optimization (PPO) to maximize the reward model’s score.
  • Conclusion : InstructGPT models trained via this pipeline:Follow instructions better than GPT-3.


LlamaFactory : Unified Efficient Fine-Tuning of 100+ Language Models

  • Abstract : LlamaFactory is a unified framework that streamlines efficient fine‑tuning across 100+ LLMs
  • Introduction
    • With the proliferation of open-source LLMs, efficient adaptation is crucial. However, implementing fine‑tuning across varying architectures remains labor-intensive.
    • LlamaFactory consolidates diverse efficient fine‑tuning algorithms—ranging from LoRA variants to advanced optimizers.
  • Efficient Fine-Tuning Techniques:
    • Efficient Optimization
      • Freeze-tuning (Houlsby et al., 2019) : involves freezing a majority of params while finetuning the remaining parameters in a small subset of decoder layers
      • GaLore (Zhao et al., 2024) : projects gradients into a lower-dimensional space, facilitating full-parameter learning in a memory efficient manner.
      • BAdam (Luo et al.,2024) : leverages block coordinate descent (BCD) to efficiently optimize the extensive parameters.
      • LoRA (Hu et al., 2022) : freezes all pre-trained weights and introduces a pair of trainable low-rank matrices to the designated layer (attention layer)
      • QLoRA (Dettmers et al., 2023) : LoRA combined with quantization, which additionally reduces the memory usage
      • DoRA (Liu et al., 2024) : breaks down pre-trained weights into magnitude and direction components and updates directional components for enhanced performance
      • LoRA+ (Hayou et al., 2024) : is proposed to overcome the sub-optimality of LoRA.
      • PiSSA (Meng et al., 2024) : initializes adapters with the principal components of the pre-trained weights for faster convergence.
    • Efficient Computation
      • Mixed Precision Training (Micikevicius et al., 2018) : Drawing insights from the examination of the input-output (IO) expenses of the attention layer
      • Flash attention (Dao et al., 2022) : introduces a hardware-friendly approach to enhance attention computation.
      • S2 attention (Chenet al., 2024b) : tackles the challenge of extended context with shifted sparse attention, thereby diminishing memory usage in fine-tuning long-context LLMs
      • Various quantization strategies (Dettmers et al., 2022a; Frantar et al., 2023; Lin et al., 2023; Egiazarian et al., 2024) : decrease memory requirements in large language models (LLMs) by utilizing lower-precision representations for weights
      • Unsloth (Han and Han, 2023) incorporates Triton for implementing the backward propagation of LoRA, which reduces floating-point operations (FLOPs) during gradient descent and leads to expedited LoRA training.
  • Llama Factory Framework
    • LLAMAFACTORY consists of three main modules:
      • Model Loader, manipulates various model architectures for fine-tuning, supporting both LLMs and vision language models (VLMs).
      • Data Worker, processes data from different tasks through a well-designed pipeline, supporting both single-turn and multi-turn dialogues.
      • Trainer, applies efficient fine-tuning techniques to different training approaches, supporting pretraining, instruction tuning and preference optimization
  • Conclusion : LlamaFactory enables efficient, scalable fine‑tuning across diverse LLMs, validated empirically via strong performance in language modeling and text generation tasks.


Freeze-Tuning : Parameter-Efficient Transfer Learning for NLP (Houlsby et al., 2019)

  • Abstract:
    • The paper proposes adapter modules as a parameter-efficient alternative to full fine-tuning of large pretrained models like BERT.
    • Instead of updating all model weights, small bottleneck layers (adapters) are inserted into each layer and only these are trained for new tasks.
    • This significantly reduces the number of trainable parameters while maintaining comparable performance to full fine-tuning.
  • Introduction :
    • Transfer learning with large pretrained language models has achieved strong results across NLP tasks, but fine-tuning them fully for each task is resource-intensive and parameter-inefficient, especially when many tasks are involved.
    • The authors argue for a more efficient approach where most model parameters remain frozen and they propose using adapters, small additional modules, to enable task-specific adaptation with minimal new parameters.
  • Methods :
    • The authors design lightweight adapter layers that are inserted within each layer of a pretrained transformer.
      • consist of a down-projection to a small dimension, a non-linearity, and an up-projection back to the original size, followed by a residual connection.
    • During training on a new task, only the adapter parameters are updated, the original model weights remain fixed.
    • This approach allows each task to have its own set of adapters while sharing the main backbone, drastically reducing storage and computation costs.
  • Conclusion : The experiments show that adapter-based tuning achieves near state-of-the-art performance on various NLP tasks with a fraction of task-specific parameters (around 3% per task).


LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2022)

  • Abstract :
    • Instead of updating all model params, LoRA injects trainable low-rank matrices into certain layers (e.g., attention layers), drastically reducing the number of trainable parameters.
  • Introduction :
    • Large pre-trained models have become standard in NLP, but fine-tuning them requires updating billions of parameters, which is resource-intensive and often infeasible for smaller organizations.
    • Existing parameter-efficient tuning methods still involve adding many new parameters. LoRA introduces a new solution by decomposing weight updates into low-rank representations, significantly reducing the need for large-scale parameter updates and making adaptation more efficient and scalable.
  • Methods : 
    • LoRA freezes the original model weights and injects small trainable matrices into the architecture (typically the attention layers).
    • Specifically, it approximates weight updates as a product of two smaller matrices (low-rank decomposition), effectively reducing the parameter count.
    • This method allows for fast adaptation to new tasks while maintaining the original model’s knowledge and minimizing extra memory and computation cost.
  • Conclusion:
    • LoRA shows that low-rank adaptation can match or even surpass full fine-tuning performance on various tasks while training orders of magnitude fewer parameters.
    • This makes LLMs more accessible, easier to personalize, and more practical for real-world deployment.


QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)

  • Abstract :
    • QLoRA uses 4-bit quantization to reduce memory requirements and integrates LoRA (low-rank adapters) for parameter-efficient fine-tuning.
    • This approach allows large models to be trained on single GPUs without compromising performance and achieves comparable or better results than full fine-tuning at a fraction of the cost.
  • Introduction:
    • Existing parameter-efficient methods like LoRA reduce the number of trainable parameters but still need high-precision storage for base model weights, making them memory-intensive.
    • QLoRA addresses this by quantizing the base model to 4-bit precision, significantly reducing memory footprint, and using low-rank updates for fine-tuning.
  • Methods :
    1. applies 4-bit NormalFloat (NF4) quantization to the frozen base model weights, drastically cutting memory use.
    2. attaches small trainable low-rank adapters (as in LoRA) to certain layers.
    3. During training, only these adapters are updated, while the quantized base remains fixed.
  • Conclusion
    • demonstrates that it is possible to fine-tune LLMs using significantly less memory and compute without sacrificing accuracy.
    • The method achieves strong performance on various benchmarks and enables practical fine-tuning of models with tens of billions of parameters on a single GPU.


DoRA : Weight-Decomposed Low-Rank Adapdation (Liu et al., 2024)

  • Abstract :
    • Instead of adding explicit low-rank matrices (as in LoRA), DoRA decomposes the original weight matrices into magnitude and direction, learning only the directional component while keeping the magnitude fixed.
  • Introduction :
    • Scaling up LLMs has led to a need for more efficient fine-tuning techniques.
    • Existing methods like LoRA inject low-rank matrices into pre-trained weights but can still suffer from overfitting or suboptimal updates.
    • DoRA is proposed to address these limitations by rethinking how model weights are adapted: Focusing on updating only the direction of weights while freezing their norms.
  • Methods:
    • DoRA decomposes each weight matrix $W$ into a norm (magnitude) and a normalized directional component $W’$, such that $W = ||W|| \cdot W’$.
    • During adaptation, only $W’$ is updated using a low-rank matrix, while $||W||$ is kept fixed from the pre-trained model.
      • This reduces the risk of introducing harmful perturbations and keeps the weight scale stable.
    • The approach is implemented similarly to LoRA, but with explicit norm-direction separation, and reuses efficient low-rank optimization structures.
  • Conclusion
    • DoRA demonstrates that focusing on directional updates leads to better performance than standard low-rank adaptation.
    • It achieves higher accuracy across benchmarks and better generalization, with minimal additional parameter cost. '


LoRA+: Efficient Low Rank Adaptation of Large Models (Hayou et al., 2024)

  • Abstract :
    • The key limitation in the original LoRA method: using the same learning rate for both adapter matrices A and B leads to inefficient feature learning
    • Authors propose LoRA+, which sets different learning rates (using a higher rate for B relative to A ) to overcome this issue.
    • Empirical results show LoRA+ enhances fine-tuning speed (up to 2× faster) and accuracy (1–2% gain), with no additional computational cost.
  • Introduction :
    • LLMs are central to modern NLP, but full fine-tuning is resource-intensive.
    • While effective in reducing costs, standard LoRA uses the same learning rate for both A and B, which becomes suboptimal for wide networks.
  • Methods :
    • Equal learning rates of two matrices in LoRA lead to one matrix under-updating, hindering feature learning → To address this, a higher learning rate for B (relative to A) is derived (e.g., η_B ≫ η_A)
    • Implements two separate learning rates, η_A and η_B, with a fixed, large ratio (λ = η_B / η_A).
  • Conclusion:
    • LoRA+ offers a principled enhancement over LoRA by applying different learning rates to the adapter matrices
    • It delivers meaningful benefits in both performance and efficiency without added overhead, making it a practical upgrade for fine-tuning large models.


DPO: Direct Preference Optimization (Rafael et al., 2023)

  • Abstract :
    • The paper introduces DPO, a simple and effective method for aligning LLMs with human preferences without reinforcement learning.
    • Unlike RLHF, DPO directly optimizes the model to prefer responses that humans rate higher, using only preference data and log probabilities. (Binary Cross Entropy)
  • Introduction :
    • Traditional RLHF pipelines are complex: they require training a separate reward model and performing reinforcement learning (often PPO).
    • DPO aims to simplify this process by deriving a closed-form objective that connects the preference data directly to model training.
    • The authors show that this objective implicitly performs the same preference alignment as RLHF but with fewer components and hyperparameters.
  • Method :
    • DPO starts with pairwise human preference data — pairs of responses (preferred, dispreferred).
    • Computes the log-probability ratio between these two responses under the model and a reference model.
    • The loss encourages the model to increase the probability of preferred responses relative to dispreferred ones according to a temperature parameter $\beta$
    • This directly optimizes model parameters via standard supervised learning — no reward model, no policy optimization step.
  • Conclusion:
    • DPO achieves performance comparable to RLHF while being simpler, more stable, and computationally cheaper.
    • It eliminates the need for a separate reward model or reinforcement learning loop, making preference-based fine-tuning more practical.