Parameter‑Efficient Fine‑Tuning for Large Models
LoRA (Low‑Rank Adaptation) fine‑tunes large neural networks by learning a small low‑rank update to selected weight matrices while keeping the original weights frozen. This dramatically reduces trainable parameters and optimizer state, enabling affordable, fast, and modular adaptation on modest hardware. In practice, LoRA achieves comparable quality to full fine‑tuning for many tasks while using 10–1000× fewer trainable parameters.
Consider a linear layer with weight W ∈ ℝdout×din. Full fine‑tuning learns ΔW of the same size. LoRA constrains the update to have low rank r ≪ min(dout, din):
The adapted weight is then
where α is a scaling hyperparameter that controls the magnitude of the low‑rank update. In most implementations, W is frozen and only A, B are trained.
Full update: dout×din parameters. LoRA update: r·(din + dout) parameters (for A and B). For square layers with d = din = dout:
Example: d = 4096, r = 8 → Params(full) = 16,777,216 vs. Params(LoRA) = 65,536 (≈256× fewer). Optimizer memory also drops proportionally because moments are maintained only for A and B.
For input x ∈ ℝdin:
The LoRA branch is a small bottleneck MLP (A then B) added residually to the frozen linear layer.
In Transformers, LoRA is commonly applied to attention projections where most parameters live and adaptation is impactful:
Adapters are trained per target task/domain; the base model remains unchanged.
| Aspect | Full Fine‑Tuning | LoRA |
|---|---|---|
| Trainable params per d×d layer | d² | 2 d r |
| Backward optimizer states | Moments for all d² weights | Moments only for A, B |
| Extra FLOPs per forward | — | O(d r) for A x and O(d r) for B(·) |
| Inference overhead | — | Zero after merge: W ← W + (α/r) · BA |
Since r ≪ d, compute overhead is negligible. During inference, merge the adapter into W to avoid additional matmuls.
import math
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, r=8, alpha=16, dropout=0.0, bias=True):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.r = r
self.alpha = alpha
self.scaling = alpha / r if r > 0 else 1.0
# Frozen base weight
self.weight = nn.Parameter(torch.empty(out_features, in_features))
nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
self.weight.requires_grad = False
self.bias = nn.Parameter(torch.zeros(out_features)) if bias else None
# Trainable low-rank factors (A: r×in, B: out×r)
if r > 0:
self.A = nn.Parameter(torch.zeros(r, in_features))
self.B = nn.Parameter(torch.zeros(out_features, r))
nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
nn.init.zeros_(self.B) # start near zero so W' ≈ W
else:
self.register_parameter('A', None)
self.register_parameter('B', None)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Base
y = x @ self.weight.T
if self.bias is not None:
y = y + self.bias
# Low-rank residual
if self.r > 0:
y = y + self.scaling * (self.dropout(x) @ self.A.T @ self.B.T)
return y
@torch.no_grad()
def merge_adapter_(self):
if self.r > 0:
self.weight += self.scaling * (self.B @ self.A)
# After merging, you may set r=0 to disable adapter branch
In practice, LoRA is wrapped around existing linear layers (e.g., attention projections). Many libraries implement injection utilities to replace layers without editing model code.
| Method | Idea | Trainable Params | Notes |
|---|---|---|---|
| LoRA | Low‑rank residual update B A | Low (∝ r) | Mergeable; good quality/efficiency trade‑off |
| Adapters | Small MLP blocks in residuals | Low–Medium | Heavier at inference unless merged |
| Prefix/Prompt Tuning | Learned prompts/key‑values | Very Low | Strong for generation; may underperform on some tasks |
LoRA optimizes the loss L(W') over low‑rank factors with W' = W + (α/r)·BA. With W frozen, the gradients factor cleanly:
Zero‑initializing B makes W' ≈ W at step 0, stabilizing early training. The α scale balances update magnitude. Combine with gradient clipping (e.g., 0.5–1.0) for stability on long sequences.
If a full fine‑tune produces ΔW*, the best rank‑r approximation (Frobenius norm) is given by truncated SVD. This motivates LoRA’s parameterization:
Practical corollary: compress a full fine‑tune by SVD‑factoring ΔW* into B A with rank r, or initialize LoRA from an SVD of a few saved full‑tune checkpoints.
Attention projections are often packed as Wqkv ∈ ℝ3dmodel×dmodel. You can attach one adapter to the packed matrix or split by projection or head:
# Inject LoRA into two Linear layers (qkv, proj) in an attention block
class LoRAInject:
def __init__(self, module, targets, r=8, alpha=16):
for name in targets: # e.g., ["qkv", "proj"]
base = getattr(module, name)
lora = LoRALinear(base.in_features, base.out_features, r=r, alpha=alpha, bias=(base.bias is not None))
with torch.no_grad():
lora.weight.copy_(base.weight)
if base.bias is not None:
lora.bias.copy_(base.bias)
setattr(module, name, lora)
# Example: LoRAInject(self.attn, ["qkv", "proj"], r=8, alpha=16)
QLoRA adds adapters on top of 4‑bit quantized base weights to minimize memory without sacrificing much quality. The base W is quantized (e.g., NF4) and dequantized during forward passes; A and B remain trainable at higher precision.
NF4 (normal‑float 4‑bit) with double quantization for scales and paged optimizers are commonly used. This enables 7B+ models to fine‑tune on a single high‑memory GPU.
Train adapters for different domains or tasks and combine them at inference:
Merge pre‑blended adapters into W for zero‑overhead deployment.
# Training (LoRA)
freeze(base_model.parameters())
for batch in data:
logits = model(x) # uses W' = W + (α/r)·BA
loss = loss_fn(logits, y)
loss.backward() # grads only for A, B
clip_grad_norm_(adapters, 1.0)
opt.step(); opt.zero_grad()
# Deployment
with torch.no_grad():
for layer in model.lora_layers:
layer.merge_adapter_() # W ← W + (α/r)·BA
drop_adapters(model) # keep W only (zero overhead)