LM #3 | Language Model Fine-tuning

2024-04-13 3. Natural Language Comments

Introduction

Overview (SFT)
- LoRA
  - LoRA
  - QLoRA
  - DoRA
  - LoRA+
- PO (Preference Optimization)
  - RLHF (reinforcement learning from human feedback)
  - RLAIF
  - DPO ( direct preference optimization)

SFT : Training language models to follow instructions with human feedback (2022)

Introduction
- Large language models (LLMs) are powerful but often fail to follow human instructions reliably. Simply scaling models doesn’t solve this.
- The paper proposes an alignment-focused training method where a base model is first fine-tuned on human-written responses (SFT), then optimized to match human preferences using RLHF.
Methods
- Supervised Fine-Tuning (SFT): Human labelers write ideal responses to prompts. And fine-tune the base GPT-3 model on this data.
- Training Reward Model: Multiple model outputs are shown to human evaluators. Evaluators rank them, and train a reword model to predict those rankings (human preferred output).
- Reinforcement Learning (PPO): The SFT model is further tuned using Proximal Policy Optimization (PPO) to maximize the reward model’s score.
Conclusion : InstructGPT models trained via this pipeline:Follow instructions better than GPT-3.

LlamaFactory : Unified Efficient Fine-Tuning of 100+ Language Models

Abstract : LlamaFactory is a unified framework that streamlines efficient fine‑tuning across 100+ LLMs
Introduction
- With the proliferation of open-source LLMs, efficient adaptation is crucial. However, implementing fine‑tuning across varying architectures remains labor-intensive.
- LlamaFactory consolidates diverse efficient fine‑tuning algorithms—ranging from LoRA variants to advanced optimizers.
Efficient Fine-Tuning Techniques:
- Efficient Optimization
  - Freeze-tuning (Houlsby et al., 2019) : involves freezing a majority of params while finetuning the remaining parameters in a small subset of decoder layers
  - GaLore (Zhao et al., 2024) : projects gradients into a lower-dimensional space, facilitating full-parameter learning in a memory efficient manner.
  - BAdam (Luo et al.,2024) : leverages block coordinate descent (BCD) to efficiently optimize the extensive parameters.
  - LoRA (Hu et al., 2022) : freezes all pre-trained weights and introduces a pair of trainable low-rank matrices to the designated layer (attention layer)
  - QLoRA (Dettmers et al., 2023) : LoRA combined with quantization, which additionally reduces the memory usage
  - DoRA (Liu et al., 2024) : breaks down pre-trained weights into magnitude and direction components and updates directional components for enhanced performance
  - LoRA+ (Hayou et al., 2024) : is proposed to overcome the sub-optimality of LoRA.
  - PiSSA (Meng et al., 2024) : initializes adapters with the principal components of the pre-trained weights for faster convergence.
- Efficient Computation
  - Mixed Precision Training (Micikevicius et al., 2018) : Drawing insights from the examination of the input-output (IO) expenses of the attention layer
  - Flash attention (Dao et al., 2022) : introduces a hardware-friendly approach to enhance attention computation.
  - S2 attention (Chenet al., 2024b) : tackles the challenge of extended context with shifted sparse attention, thereby diminishing memory usage in fine-tuning long-context LLMs
  - Various quantization strategies (Dettmers et al., 2022a; Frantar et al., 2023; Lin et al., 2023; Egiazarian et al., 2024) : decrease memory requirements in large language models (LLMs) by utilizing lower-precision representations for weights
  - Unsloth (Han and Han, 2023) incorporates Triton for implementing the backward propagation of LoRA, which reduces floating-point operations (FLOPs) during gradient descent and leads to expedited LoRA training.
Llama Factory Framework
- LLAMAFACTORY consists of three main modules:
  - Model Loader, manipulates various model architectures for fine-tuning, supporting both LLMs and vision language models (VLMs).
  - Data Worker, processes data from different tasks through a well-designed pipeline, supporting both single-turn and multi-turn dialogues.
  - Trainer, applies efficient fine-tuning techniques to different training approaches, supporting pretraining, instruction tuning and preference optimization
Conclusion : LlamaFactory enables efficient, scalable fine‑tuning across diverse LLMs, validated empirically via strong performance in language modeling and text generation tasks.

Freeze-Tuning : Parameter-Efficient Transfer Learning for NLP (Houlsby et al., 2019)

Abstract:
- The paper proposes adapter modules as a parameter-efficient alternative to full fine-tuning of large pretrained models like BERT.
- Instead of updating all model weights, small bottleneck layers (adapters) are inserted into each layer and only these are trained for new tasks.
- This significantly reduces the number of trainable parameters while maintaining comparable performance to full fine-tuning.
Introduction :
- Transfer learning with large pretrained language models has achieved strong results across NLP tasks, but fine-tuning them fully for each task is resource-intensive and parameter-inefficient, especially when many tasks are involved.
- The authors argue for a more efficient approach where most model parameters remain frozen and they propose using adapters, small additional modules, to enable task-specific adaptation with minimal new parameters.
Methods :
- The authors design lightweight adapter layers that are inserted within each layer of a pretrained transformer.
  - consist of a down-projection to a small dimension, a non-linearity, and an up-projection back to the original size, followed by a residual connection.
- During training on a new task, only the adapter parameters are updated, the original model weights remain fixed.
- This approach allows each task to have its own set of adapters while sharing the main backbone, drastically reducing storage and computation costs.
Conclusion : The experiments show that adapter-based tuning achieves near state-of-the-art performance on various NLP tasks with a fraction of task-specific parameters (around 3% per task).

LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2022)

Abstract :
- Instead of updating all model params, LoRA injects trainable low-rank matrices into certain layers (e.g., attention layers), drastically reducing the number of trainable parameters.
Introduction :
- Large pre-trained models have become standard in NLP, but fine-tuning them requires updating billions of parameters, which is resource-intensive and often infeasible for smaller organizations.
- Existing parameter-efficient tuning methods still involve adding many new parameters. LoRA introduces a new solution by decomposing weight updates into low-rank representations, significantly reducing the need for large-scale parameter updates and making adaptation more efficient and scalable.
Methods :
- LoRA freezes the original model weights and injects small trainable matrices into the architecture (typically the attention layers).
- Specifically, it approximates weight updates as a product of two smaller matrices (low-rank decomposition), effectively reducing the parameter count.
- This method allows for fast adaptation to new tasks while maintaining the original model’s knowledge and minimizing extra memory and computation cost.
Conclusion:
- LoRA shows that low-rank adaptation can match or even surpass full fine-tuning performance on various tasks while training orders of magnitude fewer parameters.
- This makes LLMs more accessible, easier to personalize, and more practical for real-world deployment.

QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)

Abstract :
- QLoRA uses 4-bit quantization to reduce memory requirements and integrates LoRA (low-rank adapters) for parameter-efficient fine-tuning.
- This approach allows large models to be trained on single GPUs without compromising performance and achieves comparable or better results than full fine-tuning at a fraction of the cost.
Introduction:
- Existing parameter-efficient methods like LoRA reduce the number of trainable parameters but still need high-precision storage for base model weights, making them memory-intensive.
- QLoRA addresses this by quantizing the base model to 4-bit precision, significantly reducing memory footprint, and using low-rank updates for fine-tuning.
Methods :
1. applies 4-bit NormalFloat (NF4) quantization to the frozen base model weights, drastically cutting memory use.
2. attaches small trainable low-rank adapters (as in LoRA) to certain layers.
3. During training, only these adapters are updated, while the quantized base remains fixed.
Conclusion
- demonstrates that it is possible to fine-tune LLMs using significantly less memory and compute without sacrificing accuracy.
- The method achieves strong performance on various benchmarks and enables practical fine-tuning of models with tens of billions of parameters on a single GPU.

DoRA : Weight-Decomposed Low-Rank Adapdation (Liu et al., 2024)

Abstract :
- Instead of adding explicit low-rank matrices (as in LoRA), DoRA decomposes the original weight matrices into magnitude and direction, learning only the directional component while keeping the magnitude fixed.
Introduction :
- Scaling up LLMs has led to a need for more efficient fine-tuning techniques.
- Existing methods like LoRA inject low-rank matrices into pre-trained weights but can still suffer from overfitting or suboptimal updates.
- DoRA is proposed to address these limitations by rethinking how model weights are adapted: Focusing on updating only the direction of weights while freezing their norms.
Methods:
- DoRA decomposes each weight matrix $W$ into a norm (magnitude) and a normalized directional component $W’$, such that $W = ||W|| \cdot W’$.
- During adaptation, only $W’$ is updated using a low-rank matrix, while $||W||$ is kept fixed from the pre-trained model.
  - This reduces the risk of introducing harmful perturbations and keeps the weight scale stable.
- The approach is implemented similarly to LoRA, but with explicit norm-direction separation, and reuses efficient low-rank optimization structures.
Conclusion
- DoRA demonstrates that focusing on directional updates leads to better performance than standard low-rank adaptation.
- It achieves higher accuracy across benchmarks and better generalization, with minimal additional parameter cost. '

LoRA+: Efficient Low Rank Adaptation of Large Models (Hayou et al., 2024)

Abstract :
- The key limitation in the original LoRA method: using the same learning rate for both adapter matrices A and B leads to inefficient feature learning
- Authors propose LoRA+, which sets different learning rates (using a higher rate for B relative to A ) to overcome this issue.
- Empirical results show LoRA+ enhances fine-tuning speed (up to 2× faster) and accuracy (1–2% gain), with no additional computational cost.
Introduction :
- LLMs are central to modern NLP, but full fine-tuning is resource-intensive.
- While effective in reducing costs, standard LoRA uses the same learning rate for both A and B, which becomes suboptimal for wide networks.
Methods :
- Equal learning rates of two matrices in LoRA lead to one matrix under-updating, hindering feature learning → To address this, a higher learning rate for B (relative to A) is derived (e.g., η_B ≫ η_A)
- Implements two separate learning rates, η_A and η_B, with a fixed, large ratio (λ = η_B / η_A).
Conclusion:
- LoRA+ offers a principled enhancement over LoRA by applying different learning rates to the adapter matrices
- It delivers meaningful benefits in both performance and efficiency without added overhead, making it a practical upgrade for fine-tuning large models.

DPO: Direct Preference Optimization (Rafael et al., 2023)

Abstract :
- The paper introduces DPO, a simple and effective method for aligning LLMs with human preferences without reinforcement learning.
- Unlike RLHF, DPO directly optimizes the model to prefer responses that humans rate higher, using only preference data and log probabilities. (Binary Cross Entropy)
Introduction :
- Traditional RLHF pipelines are complex: they require training a separate reward model and performing reinforcement learning (often PPO).
- DPO aims to simplify this process by deriving a closed-form objective that connects the preference data directly to model training.
- The authors show that this objective implicitly performs the same preference alignment as RLHF but with fewer components and hyperparameters.
Method :
- DPO starts with pairwise human preference data — pairs of responses (preferred, dispreferred).
- Computes the log-probability ratio between these two responses under the model and a reference model.
- The loss encourages the model to increase the probability of preferred responses relative to dispreferred ones according to a temperature parameter $\beta$
- This directly optimizes model parameters via standard supervised learning — no reward model, no policy optimization step.
Conclusion:
- DPO achieves performance comparable to RLHF while being simpler, more stable, and computationally cheaper.
- It eliminates the need for a separate reward model or reinforcement learning loop, making preference-based fine-tuning more practical.