MAML: Model-Agnostic Meta -Learning for Fast Adaptation of Deep Networks

Type

Technical

Published year

2017

URL

https://arxiv.org/abs/1703.03400

Journal / Conference

ICML

Keyword

Meta learning

Status

Done

Language

🇺🇸

Blog upload

Yes

Date

2021/09/05

1 more property

ICML, ‘17

https://arxiv.org/abs/1703.03400

Summary

•

MAML is a general and model-agnostic algorithm that can be directly applied to a model trained with gradient descent procedure.

•

MAML does not expand the number of learned parameters.

•

MAML does not place constraints on the model architecture.

Key words

•

Model agnostic

•

Fast adaptation

•

Optimization based approach

•

Learning good model parameters

Prelimiaries

Common Approaches of Meta-Learning and MAML

A few terminologies of meta-learning problems

1. Introduction

Goal of ideal artificial agent:

Learning and adapting quickly from only a few examples.

To do so, an agent must..

•

Integrate its prior experience with a small amount of new information.

•

Avoid overfitting to the new data.

→ Meta-learning has same goals.

MAML:

"The key idea of MAML is to train the model's initial parameters such that the model has maximal performance on a new task after the parameters have been updated through one or more gradient steps computed with a small amount of data from that new task."

Learning process of MAML:

MAML maximizes the sensitivity of the loss functions of new tasks.

Authors demonstrated the algorithm on three different model types.

•

Few-shot regression

•

Image classification

•

Reinforcement learning

2. Model-Agnostic Meta Learning

2.1. Meta-Learning Problem Set-Up

To apply MAML to a variety of learning problems, authors introduce a generic notion of a learning task:

\mathcal{T} = \{ \mathcal{L}(\mathbf{x}_1, \mathbf{a}_1, ..., \mathbf{x}_H, \mathbf{a}_H), q(\mathbf{x}_1), q(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t), H \}

Each task

\mathcal{T}

consists of..

\mathcal{L}

: a loss function, might be misclassification loss or a cost function in a Markov decision process

q(\mathbf{x}_1)

: a distribution over initial observations

q(\mathbf{x}_{t+1}|\mathbf{x}_t , \mathbf{a}_t)

: a transition distribution

H

: an episode length(e.g. in i.i.d. supervised learning problems, the length

H = 1

Authors consider a distribution over tasks

p(\mathcal{T})

Meta-training:

A new task

\mathcal{T}_i

is sampled from

p(\mathcal{T})

The model is trained with only

K

samples drawn from

q_i

Loss

\mathcal{L}_{\mathcal{T}_i}

is calculated and feedbacked to model.

Model

f

is tested on new samples from

\mathcal{T}_i

The model

f

is then improved by considering how the

test

error on new data from

q_i

changes with respect to the parameters.

2.2. A Model-Agnostic Meta-Learning Algorithm

Intuition: Some internal representations are more transferrable than others. How can we encourage the emergence of such general-purpose representations?

•

A model fθf_\thetafθ​ has paramters θ\thetaθ.

•

For each task Ti\mathcal{T}_iTi​, fθf_\thetafθ​'s parameters θ\thetaθ become θi′\theta_i'θi′​.

•

Algorithm

cf) Terminologies for below description(temporarily defined by JH Gu)

◦

Divide tasks

Separate tasks into meta-training task set({Titr}\{\mathcal{T}_i^{\text{tr}}\}{Titr​}) and meta-test task set({Titest}\{\mathcal{T}_i^{\text{test}}\}{Titest​}).

(We can think of

\{\mathcal{T}_i^{\text{tr}}\}

as monthly tests(모의고사), and

\{\mathcal{T}_i^{\text{test}}\}

as annual tests(수능))

For each task, divide each samples into DTistudy\mathcal{D}_{\mathcal{T}_i}^{\text{study}}DTi​study​(task-specific samples for studying, also called as support set), DTicheck\mathcal{D}_{\mathcal{T_i}}^{\text{check}}DTi​check​(task-specific samples for checking, also called as query set)

(We can think of

\mathcal{D}_{\mathcal{T}_i}^{\text{study}}

as 필수예제 in 수학의 정석, and

\mathcal{D}_{\mathcal{T}_i}^{\text{check}}

as 연습문제 in 수학의 정석)

◦

Meta-training using meta-training task set {Titr}\{\mathcal{T}_i^{\text{tr}}\}{Titr​}

▪

Inner loop(task-specific KKK-shot learning)

For each

\mathcal{T}_i

\{\mathcal{T}_i^{\text{tr}}\}

, a new parameter

\theta_i'

is created.

Each θi′\theta_i'θi′​ is initialized as θ\thetaθ.

With task-specific samples for studying(DTitrstudy\mathcal{D}_{\mathcal{T}_i^{\text{tr}}}^{\text{study}}DTitr​study​), each θi′\theta_i'θi′​ is updated by:

\theta_i' = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(f_\theta)

▪

Outer loop(meta-learning across tasks)

With task-specific samples for checking(DTitrcheck\mathcal{D}_{\mathcal{T_i}^{\text{tr}}}^{\text{check}}DTi​trcheck​), θ\thetaθ is updated by:

\theta = \theta - \beta \nabla_\theta \sum_{\mathcal{T}_i \sim p(\mathcal{T})}\mathcal{L}_{\mathcal{T}_i} (f_{\theta_i'})

cf) second-order derivative(Hessian) problem

◦

Measure model performance using meta-test task set {Titest}\{\mathcal{T}_i^{\text{test}}\}{Titest​}

For each Ti\mathcal{T}_iTi​ in {Titest}\{\mathcal{T}_i^{\text{test}}\}{Titest​}, adjust task-specific parameters with DTiteststudy\mathcal{D}_{\mathcal{T}_i^{\text{test}}}^{\text{study}}DTitest​study​.

Test the performance with DTitestcheck\mathcal{D}_{\mathcal{T_i}^{\text{test}}}^{\text{check}}DTi​testcheck​.

3. Species of MAML

3.1. Supervised Regression and Classification

•

Algorithm

•

Formalizing supervised regression and classification

◦

Horizon H=1H = 1H=1

◦

Drop the timestep subscript on xt\mathbf{x}_txt​ (since model accepts a single input and produces a single output)

◦

The task Ti\mathcal{T}_iTi​ generates KKK i.i.d. observations x\mathbf{x}x from qiq_iqi​

◦

Task loss is represented by the error between the model's output for x\mathbf{x}x and the corresponding target values y\mathbf{y}y.

•

Loss functions

◦

MSE for regression

\mathcal{L}_{\mathcal{T}_i}(f_\phi) = \sum_{\mathbf{x}^{(j)}, \mathbf{y}^{(j)} \sim \mathcal{T}_i} \| f_\phi(\mathbf{x}^{(j)}) - \mathbf{y}^{(j)}\|^2_2

◦

Cross entropy loss for discrete classification

\mathcal{L}_{\mathcal{T}_i}(f_\phi) = \sum_{\mathbf{x}^{(j)}, \mathbf{y}^{(j)} \sim \mathcal{T}_i} \big\{ \mathbf{y}^{(j)} \log f_\phi(\mathbf{x}^{(j)}) - (1-\mathbf{y}^{(j)})\log(1-f_\phi(\mathbf{x}^{(j)}))\big\}

3.2. Reinforcement Learning

•

Algorithm

•

Goal of MAML in RL:

Quickly acquire a policy for a new test task using only a small amount of experience in the test setting.

•

Formalizing RL

Each RL task

\mathcal{T}_i

contains..

◦

Initial state distribution qi(x1)q_i(\mathbf{x}_1)qi​(x1​)

◦

Transition distribution qi(xt+1∣xt,at)q_i(\mathbf{x}_{t+1}|\mathbf{x}_t, \mathbf{a}_t)qi​(xt+1​∣xt​,at​)

▪

at\mathbf{a}_tat​: action

◦

Loss LTi\mathcal{L}_{\mathcal{T}_i}LTi​​, which corresponds to the negative reward function RRR

Therefore, entire task is a Markov decision process(MDP) with horizon

H

The model being learned,

f_\theta

, is a policy that maps from states

\mathbf{x}_t

to a distribution over actions

\mathbf{a}_t

at each timestep

t \in \{ 1, ..., H\}

•

Loss function for task Ti\mathcal{T}_iTi​ and model fϕf_\phifϕ​:

\mathcal{L}_{\mathcal{T}_i}(f_\phi) = -\mathbb{E}_{\mathbf{x}_t, \mathbf{a}_t \sim f_\phi, q_{\mathcal{T}_i}} \bigg [ \sum_{t=1}^H R_i(\mathbf{x}_t, \mathbf{a}_t) \bigg ]

•

Policy gradient method

Since the expected reward is generally not differentiable due to unknown dynamics, authors used policy gradient methods to estimate the gradient.

The policy gradient method is an on-policy algorithm

→ There are additional sampling procedures in step 5 and 8.

4. Comparison with related works

Comparison with other popular approaches

•

Training a meta-learner that learns how to update the parameters of the learner's model

ex) On the optimization of a synaptic learning rule(Bengio et al. 1992)

→ Requires additional parameters, while MAML does not.

•

Training to compare new examples in a learned metric space

ex) Siamese networks(Koch, 2015), recurrence with attention mechanisms(Vinyals et al. 2016)

→ Difficult to directly extend to our problems, such as reinforcement learning.

•

Training memory-augmented models

ex) Meta-learning with memory-augmented neural networks(Santoro et al. 2016)

The recurrent learner is trained to adapt to new tasks as it is rolled out.

→ Not really straightforward.

5. Experimental Evaluation

Three questions

Can MAML enable fast learning of new tasks?

Can MAML be used for meta-learning in multiple different domains?

Can a model learned with MAML continue to improve with additional gradient updates and/or examples?

5.1. Regression

5.2. Classification

5.3. Reinforcement Learning

References

KAIST NeuroAI JC_#1 Meta Learning (편집본)

Moderator: Soyeon KimPresenter: Hyewon Jeong

https://youtu.be/Izqod36syY8

ai.stanford.edu

https://ai.stanford.edu/~cbfinn/_files/dissertation.pdf

[10주차] (MAML) Model-agnostic Meta Learning for Fast Adaptation of Deep Networks 논문 리뷰

MAML의 키워드는 **'Model Agnostic'**과 **'Fast adapation'**입니다. Model Agnostic은 모델에 상관없이 적용 가능하다는 의미로, gradient descent방식을 사용하는 모든 모델에 MAML을 적용 가능하다는 의미입니다. Fast adaptation 은 새로운 task를 빠르게 (적은 update) 학습할 수 있다는 의미로, 메타러닝의 핵심 개념에 해당합니다. MAML은 여러 task에 대해서 적합한(broadly suitable for many tasks) 모델을 학습하면 세부 task 에 적용할 때 fine-tuning 과정이 빠를 것이라는 아이디어에 기반합니다.

https://velog.io/@tobigs_xai/10주차-MAML-Model-agnostic-Meta-Learning-for-Fast-Adaptation-of-Deep-Networks-논문-리뷰

Meta-Learning: Learning to Learn Fast

Meta-learning, also known as "learning to learn", intends to design models that can learn new skills or adapt to new environments rapidly with a few training examples. There are three common approaches: 1) learn an efficient distance metric (metric-based); 2) use (recurrent) network with external or internal memory (model-based); 3)...

https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html