LoRA: Low‑Rank Adaptation

Parameter‑Efficient Fine‑Tuning for Large Models

San Hashimhama • AI Researcher

Why LoRA Matters

LoRA (Low‑Rank Adaptation) fine‑tunes large neural networks by learning a small low‑rank update to selected weight matrices while keeping the original weights frozen. This dramatically reduces trainable parameters and optimizer state, enabling affordable, fast, and modular adaptation on modest hardware. In practice, LoRA achieves comparable quality to full fine‑tuning for many tasks while using 10–1000× fewer trainable parameters.

Core Idea & Mathematics

Consider a linear layer with weight W ∈ ℝdout×din. Full fine‑tuning learns ΔW of the same size. LoRA constrains the update to have low rank r ≪ min(dout, din):

ΔW ≈ B A, with A ∈ ℝr×din, B ∈ ℝdout×r, rank(ΔW) ≤ r

The adapted weight is then

W' = W + (α/r) · (B A)

where α is a scaling hyperparameter that controls the magnitude of the low‑rank update. In most implementations, W is frozen and only A, B are trained.

Parameter Efficiency

Full update: dout×din parameters. LoRA update: r·(din + dout) parameters (for A and B). For square layers with d = din = dout:

Params(full) = d², Params(LoRA) = 2 d r, Reduction ≈ d² / (2 d r) = d / (2 r)

Example: d = 4096, r = 8 → Params(full) = 16,777,216 vs. Params(LoRA) = 65,536 (≈256× fewer). Optimizer memory also drops proportionally because moments are maintained only for A and B.

Forward Pass with LoRA

For input x ∈ ℝdin:

y = W x + b + (α/r) · (B (A x))

The LoRA branch is a small bottleneck MLP (A then B) added residually to the frozen linear layer.

Visual Intuition

Frozen Base Weight (W) + A (r × d_in) B (d_out × r) Low‑Rank Update: BA = Adapted Weight (W') W' = W + (α/r) · (B A)
LoRA learns low‑rank factors A and B and adds their product to the frozen base weight.
Self‑Attention with LoRA W_q (frozen) LoRA: B_q A_q W_k (frozen) (optional) W_v (frozen) LoRA: B_v A_v Apply LoRA to Q and V for best efficiency/quality trade‑off
Typical placement: LoRA on Q and V projections (K optional).

Where to Apply LoRA

In Transformers, LoRA is commonly applied to attention projections where most parameters live and adaptation is impactful:

Wq' = Wq + (α/r) · BqAq,   Wv' = Wv + (α/r) · BvAv

Adapters are trained per target task/domain; the base model remains unchanged.

Complexity & Memory

Aspect Full Fine‑Tuning LoRA
Trainable params per d×d layer 2 d r
Backward optimizer states Moments for all d² weights Moments only for A, B
Extra FLOPs per forward O(d r) for A x and O(d r) for B(·)
Inference overhead Zero after merge: W ← W + (α/r) · BA

Since r ≪ d, compute overhead is negligible. During inference, merge the adapter into W to avoid additional matmuls.

Implementation Sketch (PyTorch)

import math
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, r=8, alpha=16, dropout=0.0, bias=True):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.r = r
        self.alpha = alpha
        self.scaling = alpha / r if r > 0 else 1.0

        # Frozen base weight
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        self.weight.requires_grad = False

        self.bias = nn.Parameter(torch.zeros(out_features)) if bias else None

        # Trainable low-rank factors (A: r×in, B: out×r)
        if r > 0:
            self.A = nn.Parameter(torch.zeros(r, in_features))
            self.B = nn.Parameter(torch.zeros(out_features, r))
            nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
            nn.init.zeros_(self.B)  # start near zero so W' ≈ W
        else:
            self.register_parameter('A', None)
            self.register_parameter('B', None)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Base
        y = x @ self.weight.T
        if self.bias is not None:
            y = y + self.bias
        # Low-rank residual
        if self.r > 0:
            y = y + self.scaling * (self.dropout(x) @ self.A.T @ self.B.T)
        return y

    @torch.no_grad()
    def merge_adapter_(self):
        if self.r > 0:
            self.weight += self.scaling * (self.B @ self.A)
            # After merging, you may set r=0 to disable adapter branch

In practice, LoRA is wrapped around existing linear layers (e.g., attention projections). Many libraries implement injection utilities to replace layers without editing model code.

Choosing Hyperparameters

Comparison to Other PEFT Methods

Method Idea Trainable Params Notes
LoRA Low‑rank residual update B A Low (∝ r) Mergeable; good quality/efficiency trade‑off
Adapters Small MLP blocks in residuals Low–Medium Heavier at inference unless merged
Prefix/Prompt Tuning Learned prompts/key‑values Very Low Strong for generation; may underperform on some tasks

Practical Tips & Pitfalls

In my experiments, most of the benefit comes from targeting attention Q and V with small ranks (r ≤ 16). For domain‑heavy shifts (e.g., code or math), adding LoRA to MLP layers often closes the gap to full fine‑tuning.

Training Dynamics & Gradients

LoRA optimizes the loss L(W') over low‑rank factors with W' = W + (α/r)·BA. With W frozen, the gradients factor cleanly:

Let G = ∂L/∂W' ∈ ℝdout×din.

∂L/∂A = (α/r) · BT G ∈ ℝr×din
∂L/∂B = (α/r) · G AT ∈ ℝdout×r

Zero‑initializing B makes W' ≈ W at step 0, stabilizing early training. The α scale balances update magnitude. Combine with gradient clipping (e.g., 0.5–1.0) for stability on long sequences.

Optimization Recipe

Low‑Rank View via SVD

If a full fine‑tune produces ΔW*, the best rank‑r approximation (Frobenius norm) is given by truncated SVD. This motivates LoRA’s parameterization:

ΔW* = U Σ VT ⇒ argminrank(X)≤r ||ΔW* − X||F = U:,1:r Σ1:r,1:r V:,1:rT

Practical corollary: compress a full fine‑tune by SVD‑factoring ΔW* into B A with rank r, or initialize LoRA from an SVD of a few saved full‑tune checkpoints.

Full Update ΔW* d_out × d_in Ur Σr VrT = B A LoRA rank r
Truncated SVD gives the best rank‑r approximation; LoRA learns such structure directly.

Placement & Shapes in Transformers

Attention projections are often packed as Wqkv ∈ ℝ3dmodel×dmodel. You can attach one adapter to the packed matrix or split by projection or head:

Common shapes: Wq, Wk, Wv, Wo ∈ ℝdmodel×dmodel
# Inject LoRA into two Linear layers (qkv, proj) in an attention block
class LoRAInject:
    def __init__(self, module, targets, r=8, alpha=16):
        for name in targets:  # e.g., ["qkv", "proj"]
            base = getattr(module, name)
            lora = LoRALinear(base.in_features, base.out_features, r=r, alpha=alpha, bias=(base.bias is not None))
            with torch.no_grad():
                lora.weight.copy_(base.weight)
                if base.bias is not None:
                    lora.bias.copy_(base.bias)
            setattr(module, name, lora)

# Example: LoRAInject(self.attn, ["qkv", "proj"], r=8, alpha=16)

QLoRA: LoRA on Quantized Models

QLoRA adds adapters on top of 4‑bit quantized base weights to minimize memory without sacrificing much quality. The base W is quantized (e.g., NF4) and dequantized during forward passes; A and B remain trainable at higher precision.

Memory(base) ≈ (bits/8)·|W| (frozen) • Memory(adapters) ≈ params(A,B) + optimizer states for A,B

NF4 (normal‑float 4‑bit) with double quantization for scales and paged optimizers are commonly used. This enables 7B+ models to fine‑tune on a single high‑memory GPU.

Full FT LoRA QLoRA Relative training memory footprint
Full fine‑tune ≫ LoRA > QLoRA for training memory usage.

Composing Multiple Adapters

Train adapters for different domains or tasks and combine them at inference:

W' = W + ∑i γi · (αi/ri) · BiAi

Merge pre‑blended adapters into W for zero‑overhead deployment.

Algorithm Summary

# Training (LoRA)
freeze(base_model.parameters())
for batch in data:
    logits = model(x)              # uses W' = W + (α/r)·BA
    loss = loss_fn(logits, y)
    loss.backward()                # grads only for A, B
    clip_grad_norm_(adapters, 1.0)
    opt.step(); opt.zero_grad()

# Deployment
with torch.no_grad():
    for layer in model.lora_layers:
        layer.merge_adapter_()     # W ← W + (α/r)·BA
drop_adapters(model)               # keep W only (zero overhead)