LoRA: Low‑Rank Adaptation for Efficient Fine‑Tuning

Why LoRA Matters

LoRA (Low‑Rank Adaptation) fine‑tunes large neural networks by learning a small low‑rank update to selected weight matrices while keeping the original weights frozen. This dramatically reduces trainable parameters and optimizer state, enabling affordable, fast, and modular adaptation on modest hardware. In practice, LoRA achieves comparable quality to full fine‑tuning for many tasks while using 10–1000× fewer trainable parameters.

Lower memory: only the low‑rank factors are trained (and their optimizer states), not the full weights.
Faster training: fewer parameters and better I/O locality for updates.
Composable: multiple LoRA adapters can be swapped or merged per task/domain.
Production friendly: merge the adapter into base weights for zero‑overhead inference.

Core Idea & Mathematics

Consider a linear layer with weight W ∈ ℝ^d_out×d_in. Full fine‑tuning learns ΔW of the same size. LoRA constrains the update to have low rank r ≪ min(d_out, d_in):

ΔW ≈ B A, with A ∈ ℝ^r×d_in, B ∈ ℝ^d_out×r, rank(ΔW) ≤ r

The adapted weight is then

W' = W + (α/r) · (B A)

where α is a scaling hyperparameter that controls the magnitude of the low‑rank update. In most implementations, W is frozen and only A, B are trained.

Parameter Efficiency

Full update: d_out×d_in parameters. LoRA update: r·(d_in + d_out) parameters (for A and B). For square layers with d = d_in = d_out:

Params(full) = d², Params(LoRA) = 2 d r, Reduction ≈ d² / (2 d r) = d / (2 r)

Example: d = 4096, r = 8 → Params(full) = 16,777,216 vs. Params(LoRA) = 65,536 (≈256× fewer). Optimizer memory also drops proportionally because moments are maintained only for A and B.

Forward Pass with LoRA

For input x ∈ ℝ^d_in:

y = W x + b + (α/r) · (B (A x))

The LoRA branch is a small bottleneck MLP (A then B) added residually to the frozen linear layer.

Where to Apply LoRA

In Transformers, LoRA is commonly applied to attention projections where most parameters live and adaptation is impactful:

Self‑Attention: W_q, W_v (sometimes W_k, rarely W_o)
Cross‑Attention (encoder‑decoder): again W_q, W_v
Optionally to large feed‑forward layers (W_in, W_out) when domain shift is significant

W_q' = W_q + (α/r) · B_qA_q, W_v' = W_v + (α/r) · B_vA_v

Adapters are trained per target task/domain; the base model remains unchanged.

Complexity & Memory

Aspect	Full Fine‑Tuning	LoRA
Trainable params per d×d layer	d²	2 d r
Backward optimizer states	Moments for all d² weights	Moments only for A, B
Extra FLOPs per forward	—	O(d r) for A x and O(d r) for B(·)
Inference overhead	—	Zero after merge: W ← W + (α/r) · BA

Since r ≪ d, compute overhead is negligible. During inference, merge the adapter into W to avoid additional matmuls.

Implementation Sketch (PyTorch)

import math
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, r=8, alpha=16, dropout=0.0, bias=True):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.r = r
        self.alpha = alpha
        self.scaling = alpha / r if r > 0 else 1.0

        # Frozen base weight
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        self.weight.requires_grad = False

        self.bias = nn.Parameter(torch.zeros(out_features)) if bias else None

        # Trainable low-rank factors (A: r×in, B: out×r)
        if r > 0:
            self.A = nn.Parameter(torch.zeros(r, in_features))
            self.B = nn.Parameter(torch.zeros(out_features, r))
            nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
            nn.init.zeros_(self.B)  # start near zero so W' ≈ W
        else:
            self.register_parameter('A', None)
            self.register_parameter('B', None)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Base
        y = x @ self.weight.T
        if self.bias is not None:
            y = y + self.bias
        # Low-rank residual
        if self.r > 0:
            y = y + self.scaling * (self.dropout(x) @ self.A.T @ self.B.T)
        return y

    @torch.no_grad()
    def merge_adapter_(self):
        if self.r > 0:
            self.weight += self.scaling * (self.B @ self.A)
            # After merging, you may set r=0 to disable adapter branch

In practice, LoRA is wrapped around existing linear layers (e.g., attention projections). Many libraries implement injection utilities to replace layers without editing model code.

Choosing Hyperparameters

Rank r: 4–16 for medium models, 8–32 for larger; increase if task/domain is far from base pretraining.
Scaling α: Often 2–8× r; tune with a small grid (e.g., {r=8, α∈{8,16,32}}).
Targets: Start with W_q, W_v; add W_o or MLP if needed.
Dropout: 0.05–0.2 on the LoRA branch for stability.
Optimizer: AdamW on adapters only; keep base weights frozen with weight_decay=0.
Precision: BF16/FP16 works well; adapters can be FP32 if memory permits.

Comparison to Other PEFT Methods

Method	Idea	Trainable Params	Notes
LoRA	Low‑rank residual update B A	Low (∝ r)	Mergeable; good quality/efficiency trade‑off
Adapters	Small MLP blocks in residuals	Low–Medium	Heavier at inference unless merged
Prefix/Prompt Tuning	Learned prompts/key‑values	Very Low	Strong for generation; may underperform on some tasks

Practical Tips & Pitfalls

Merging for deployment: After training, set W ← W + (α/r)·BA and drop A,B for zero overhead.
LayerNorm/scale interactions: Keep α modest; very large α can destabilize optimization.
Task mixing: Multiple adapters can coexist; select or blend adapters per request/domain.
Regularization: Weight decay on adapters only; consider small LoRA dropout.
Coverage: If accuracy plateaus, increase r or include MLP layers as LoRA targets.
Validation parity: Compare unmerged vs merged outputs on a batch before exporting.
Precision hygiene: Merge with consistent dtype (e.g., keep BF16 end‑to‑end).
Packed qkv nuance: If using packed W_qkv, confirm gradients don’t leak to frozen weights.
Quantization caveat: With QLoRA, ensure dequant ops are non‑differentiable only for W, not for A,B.
Check shapes: A is r×d_in, B is d_out×r; transpose mistakes are common.

In my experiments, most of the benefit comes from targeting attention Q and V with small ranks (r ≤ 16). For domain‑heavy shifts (e.g., code or math), adding LoRA to MLP layers often closes the gap to full fine‑tuning.

Training Dynamics & Gradients

LoRA optimizes the loss L(W') over low‑rank factors with W' = W + (α/r)·BA. With W frozen, the gradients factor cleanly:

Let G = ∂L/∂W' ∈ ℝ^d_out×d_in.

∂L/∂A = (α/r) · B^T G ∈ ℝ^r×d_in
∂L/∂B = (α/r) · G A^T ∈ ℝ^d_out×r

Zero‑initializing B makes W' ≈ W at step 0, stabilizing early training. The α scale balances update magnitude. Combine with gradient clipping (e.g., 0.5–1.0) for stability on long sequences.

Optimization Recipe

Freeze base weights; optimizer sees only A, B.
AdamW on adapters, β1=0.9, β2=0.999; weight_decay 0.0–0.01 on adapters.
LR 1e‑4–2e‑4 (instruction tuning), warmup 3–5%, cosine/linear decay.
LoRA dropout 0.05–0.2 when data is limited; 0.0 for large clean datasets.
FP16/BF16 training; keep adapters FP32 if memory allows for extra stability.

Low‑Rank View via SVD

If a full fine‑tune produces ΔW*, the best rank‑r approximation (Frobenius norm) is given by truncated SVD. This motivates LoRA’s parameterization:

ΔW* = U Σ V^T ⇒ argmin_rank(X)≤r ||ΔW* − X||_F = U_:,1:r Σ_1:r,1:r V_:,1:r^T

Practical corollary: compress a full fine‑tune by SVD‑factoring ΔW* into B A with rank r, or initialize LoRA from an SVD of a few saved full‑tune checkpoints.

_r Σ_r V_r^T = B A LoRA rank r

Truncated SVD gives the best rank‑r approximation; LoRA learns such structure directly.

Placement & Shapes in Transformers

Attention projections are often packed as W_qkv ∈ ℝ^{3d_model×d_model}. You can attach one adapter to the packed matrix or split by projection or head:

Packed: One adapter on W_qkv; simplest wiring.
Per‑projection: Separate adapters on W_q, W_k, W_v; typically Q and V only.
Per‑head (grouped): Block‑diagonal A,B per attention head for fine‑grained control.

Common shapes: W_q, W_k, W_v, W_o ∈ ℝ^{d_model×d_model}

# Inject LoRA into two Linear layers (qkv, proj) in an attention block
class LoRAInject:
    def __init__(self, module, targets, r=8, alpha=16):
        for name in targets:  # e.g., ["qkv", "proj"]
            base = getattr(module, name)
            lora = LoRALinear(base.in_features, base.out_features, r=r, alpha=alpha, bias=(base.bias is not None))
            with torch.no_grad():
                lora.weight.copy_(base.weight)
                if base.bias is not None:
                    lora.bias.copy_(base.bias)
            setattr(module, name, lora)

# Example: LoRAInject(self.attn, ["qkv", "proj"], r=8, alpha=16)

QLoRA: LoRA on Quantized Models

QLoRA adds adapters on top of 4‑bit quantized base weights to minimize memory without sacrificing much quality. The base W is quantized (e.g., NF4) and dequantized during forward passes; A and B remain trainable at higher precision.

Memory(base) ≈ (bits/8)·|W| (frozen) • Memory(adapters) ≈ params(A,B) + optimizer states for A,B

NF4 (normal‑float 4‑bit) with double quantization for scales and paged optimizers are commonly used. This enables 7B+ models to fine‑tune on a single high‑memory GPU.

Full fine‑tune ≫ LoRA > QLoRA for training memory usage.

Composing Multiple Adapters

Train adapters for different domains or tasks and combine them at inference:

W' = W + ∑_i γ_i · (α_i/r_i) · B_iA_i

Static blend: Fixed γ_i per route (e.g., code vs dialogue).
Gated: A small predictor sets γ_i from the prompt.
Routing by tag: Pick the adapter by metadata.

Merge pre‑blended adapters into W for zero‑overhead deployment.

Algorithm Summary

# Training (LoRA)
freeze(base_model.parameters())
for batch in data:
    logits = model(x)              # uses W' = W + (α/r)·BA
    loss = loss_fn(logits, y)
    loss.backward()                # grads only for A, B
    clip_grad_norm_(adapters, 1.0)
    opt.step(); opt.zero_grad()

# Deployment
with torch.no_grad():
    for layer in model.lora_layers:
        layer.merge_adapter_()     # W ← W + (α/r)·BA
drop_adapters(model)               # keep W only (zero overhead)