## Summary

•

Authors propose Low-Rank Adaptation (LoRA) that freezes the pretrained model weights and injects trainable rank decomposition matrices.

$W_0 + \Delta W = W_0 + BA$

•

LoRA reduced the number of trainable parameters and improved efficiency without no critical drop in accuracy and no additional inference latency.

◦

GPT-3 175B: 1/10,000 times # of parameters, 1/3 GPU memory

•

LoRA is task- and model-agnostic. (Can be applied to many fine-tuning regime)

### Related works

#### (Typical) Adapter layers

•

Most of the adapter layers introduce inference latency.

◦

Houlsby et al. (2019) - two adapter layers

◦

Lin et al. (2020)

•

Since the batch size of most online inference is 1, this induces noticeable increase in inference time.

#### Low-rank structures in Deep Learning

•

Low-rank structure:

◦

The property that the tensors of the neural network (usually weight matrices) can be approximated or decomposed into a combination of low-rank matrices or tensors.

◦

Have been observed in various deep learning tasks (especially in over-parametrized neural networks)

#### Fine-tuning

•

Full fine-tuning:

$\max_{\Phi} \sum_{(x, y) \in \mathcal{Z}}\sum_{t=1}^{|y|} \log (P_{\Phi}(y_t | x, y_{<t}))$

◦

The model is initialized to pre-trained weights $\Phi$.

◦

And updated to $\Phi_{0} + \Delta \Phi$

◦

Extremely compute-intensive for large models (such as GPT-3 with 175 Billion parameters)

### LoRA

•

Low-rank Adaptation

$W_0 + \Delta W = W_0 + BA$

$\max_{\Theta} \sum_{(x, y) \in \mathcal{Z}}\sum_{t=1}^{|y|} \log (p_{\Phi_0 + \Delta \Phi (\Theta)}(y_t|x, y_{<t}))$

◦

Task-specific parameter increment $\Delta\Phi = \Delta\Phi(\Theta)$ is encoded by much smaller sized set of parameters $\Theta$. → Compute- and memory-efficient!

•

Applying to Transformer

◦

LoRA is applied to only self-attention module weights - $W_q, W_k, W_v, W_0$

▪

Not applied to MLP module

### Results

### Further analysis

#### Optimal rank $r$ for LoRA?

•

LoRA performs well even with a very small $r$

•

Increasing $r$ does not cover a more meaningful subspace

#### Subspace similarity between different $r$

•

Directions of top singular vector overlaps significantly, while others are not.