LoRA: Low-Rank Adaptation of Large Language Models

Type

Technical

Published year

2021

URL

https://arxiv.org/abs/2106.09685

Journal / Conference

ArXiv

Keyword

Adaptation

Status

Done

Language

🇺🇸

Blog upload

Yes

Date

2023/06/15

1 more property

Summary

Related works

(Typical) Adapter layers

Low-rank structures in Deep Learning

Optimal rank r for LoRA?

Subspace similarity between different r

Comparing adaptation matrix \Delta W to W

Personal Opinion

Reference

Summary

•

Authors propose Low-Rank Adaptation (LoRA) that freezes the pretrained model weights and injects trainable rank decomposition matrices.

W_0 + \Delta W = W_0 + BA

•

LoRA reduced the number of trainable parameters and improved efficiency without no critical drop in accuracy and no additional inference latency.

◦

GPT-3 175B: 1/10,000 times # of parameters, 1/3 GPU memory

•

LoRA is task- and model-agnostic. (Can be applied to many fine-tuning regime)

Related works

(Typical) Adapter layers

•

Most of the adapter layers introduce inference latency.

◦

Houlsby et al. (2019) - two adapter layers

◦

Lin et al. (2020)

•

Since the batch size of most online inference is 1, this induces noticeable increase in inference time.

Low-rank structures in Deep Learning

•

Low-rank structure:

◦

The property that the tensors of the neural network (usually weight matrices) can be approximated or decomposed into a combination of low-rank matrices or tensors.

◦

Have been observed in various deep learning tasks (especially in over-parametrized neural networks)

Fine-tuning

•

Full fine-tuning:

\max_{\Phi} \sum_{(x, y) \in \mathcal{Z}}\sum_{t=1}^{|y|} \log (P_{\Phi}(y_t | x, y_{<t}))

◦

The model is initialized to pre-trained weights Φ\PhiΦ.

◦

And updated to Φ0+ΔΦ\Phi_{0} + \Delta \PhiΦ0​+ΔΦ

◦

Extremely compute-intensive for large models (such as GPT-3 with 175 Billion parameters)

LoRA

•

Low-rank Adaptation

W_0 + \Delta W = W_0 + BA

\max_{\Theta} \sum_{(x, y) \in \mathcal{Z}}\sum_{t=1}^{|y|} \log (p_{\Phi_0 + \Delta \Phi (\Theta)}(y_t|x, y_{<t}))

◦

Task-specific parameter increment ΔΦ=ΔΦ(Θ)\Delta\Phi = \Delta\Phi(\Theta)ΔΦ=ΔΦ(Θ) is encoded by much smaller sized set of parameters Θ\ThetaΘ. → Compute- and memory-efficient!

•

Applying to Transformer

◦

LoRA is applied to only self-attention module weights - Wq,Wk,Wv,W0W_q, W_k, W_v, W_0Wq​,Wk​,Wv​,W0​

▪

Not applied to MLP module

Results

Further analysis

Optimal rank $r$ for LoRA?

•

LoRA performs well even with a very small rrr

•

Increasing rrr does not cover a more meaningful subspace

Subspace similarity between different $r$

•

Directions of top singular vector overlaps significantly, while others are not.

LoRA: Low-Rank Adaptation of Large Language Models

Summary

Related works

(Typical) Adapter layers

Low-rank structures in Deep Learning

Fine-tuning

LoRA

Results

Further analysis

Optimal rank rrr for LoRA?

Subspace similarity between different rrr

Comparing adaptation matrix ΔW\Delta WΔW to WWW

Personal Opinion

Reference

Optimal rank $r$ for LoRA?

Subspace similarity between different $r$

Comparing adaptation matrix $\Delta W$ to $W$