LoRA: Low-Rank Adaptation of Large Language Models


Authors propose Low-Rank Adaptation (LoRA) that freezes the pretrained model weights and injects trainable rank decomposition matrices.
W0+ΔW=W0+BAW_0 + \Delta W = W_0 + BA
LoRA reduced the number of trainable parameters and improved efficiency without no critical drop in accuracy and no additional inference latency.
GPT-3 175B: 1/10,000 times # of parameters, 1/3 GPU memory
LoRA is task- and model-agnostic. (Can be applied to many fine-tuning regime)

Related works

(Typical) Adapter layers

Most of the adapter layers introduce inference latency.
Houlsby et al. (2019) - two adapter layers
Lin et al. (2020)
Since the batch size of most online inference is 1, this induces noticeable increase in inference time.

Low-rank structures in Deep Learning

Low-rank structure:
The property that the tensors of the neural network (usually weight matrices) can be approximated or decomposed into a combination of low-rank matrices or tensors.
Have been observed in various deep learning tasks (especially in over-parametrized neural networks)


Full fine-tuning:
maxΦ(x,y)Zt=1ylog(PΦ(ytx,y<t))\max_{\Phi} \sum_{(x, y) \in \mathcal{Z}}\sum_{t=1}^{|y|} \log (P_{\Phi}(y_t | x, y_{<t}))
The model is initialized to pre-trained weights Φ\Phi.
And updated to Φ0+ΔΦ\Phi_{0} + \Delta \Phi
Extremely compute-intensive for large models (such as GPT-3 with 175 Billion parameters)


Low-rank Adaptation
W0+ΔW=W0+BAW_0 + \Delta W = W_0 + BA
maxΘ(x,y)Zt=1ylog(pΦ0+ΔΦ(Θ)(ytx,y<t))\max_{\Theta} \sum_{(x, y) \in \mathcal{Z}}\sum_{t=1}^{|y|} \log (p_{\Phi_0 + \Delta \Phi (\Theta)}(y_t|x, y_{<t}))
Task-specific parameter increment ΔΦ=ΔΦ(Θ)\Delta\Phi = \Delta\Phi(\Theta) is encoded by much smaller sized set of parameters Θ\Theta. → Compute- and memory-efficient!
Applying to Transformer
LoRA is applied to only self-attention module weights - Wq,Wk,Wv,W0W_q, W_k, W_v, W_0
Not applied to MLP module


Further analysis

Optimal rank rr for LoRA?

LoRA performs well even with a very small rr
Increasing rr does not cover a more meaningful subspace

Subspace similarity between different rr

Directions of top singular vector overlaps significantly, while others are not.

Comparing adaptation matrix ΔW\Delta W to WW

Personal Opinion