Few-shot Learning with Graph Neural Networks

Type

Technical

Published year

2017

URL

https://arxiv.org/abs/1711.04043

Journal / Conference

ICLR

Keyword

Few shot learning

GNN

Status

Done

Language

🇺🇸

Blog upload

Yes

Date

2021/10/15

1 more property

Few-Shot Learning with Graph Neural Networks

We propose to study the problem of few-shot learning with the prism of inference on a partially observed graphical model, constructed from a collection of input images whose label can be either observed or not. By assimilating generic message-passing inference algorithms with their neural-network counterparts, we define a graph neural network architecture that generalizes several of the recently proposed few-shot learning models.

https://arxiv.org/abs/1711.04043

ICLR(2017), cited over 600 times

Summary

•

Few shot learning을 위해 다른 sample과의 similarity 정보까지 이용(즉, 각 sample의 label을 독립적으로 학습하는데서 그치지 않음)

•

각 sample을 graph의 node라고 보고, edge는 두 sample간의 similarity kernel로 간주.

•

Edge, 즉 similarity kernel은 trainable 함(즉, 단순한 inner product 등으로 pre-defined 되지 않음)

•

Node의 feature는 message passing algorithm에서 착안하여 각 time step 마다 이웃 node에서 message를 받아서 업데이트됨.

•

Semi-supervised learning, 더 나아가 active learning에도 적용 가능.

•

Omniglot, Mini-ImageNet에 대해 더 적은 parameter로 state-of-the-art 성능을 보여줌(2017년 기준)

Keywords

•

Few shot learning

•

Graph neural network

•

Semi-supervised learning

•

Active learning with Attention

1. Introduction

•

Supervised end-to-end learning has been extremely successful in computer vision, speech, or machine translation tasks.

•

However, there are some tasks(e.g. few shot learning) that cannot achieve high performance with conventional methods.

•

New supervised learning setup

◦

Input-output setup:

▪

With i.i.d. samples  of collections of images and their associated label similarity

▪

cf) conventional setup: i.i.d. samples of images and their associated labels

•

Authors' model can be extended to semi-supervised and active learning

◦

Semi-supervised learning:

Learning from a mixture of labeled and unlabeled examples

◦

Active learning:

The learner has the option to request those missing labels that will be most helpful for the prediction task

ICML 2019 active learning tutorial

Annotated by JH Gu

2. Closely related works and ideas

[Research article] Matching Networks for One shot learning - Vinyals et al.(2016)

[Research article] Prototypical Networks for Few-shot Learning - Snell et al.(2017)

[Review article] Geometric deep learning - Bronstein et al.(2017)

[Research article] Message passing - Gilmer et al.(2017)

3. Problem set-up

Authors view the task as a supervised interpolation problem on a graph

•

Nodes: Images

•

Edges: Similarity kernels → TRAINABLE

General set-up

Input-output pairs

(\mathcal{T}_i, Y_i)_i

drawn from i.i.d. from a distribution

\mathcal{P}

of partially labeled image collections

•

sss: # labeled samples

•

rrr: # unlabled samples

•

ttt: # samples to classify

•

KKK: # classes

•

Pl(RN)\mathcal{P}_l(\mathbb{R}^N)Pl​(RN): class-specific image distribution over RN\mathbb{R}^NRN

•

targets YiY_iYi​ are associated with xˉ1,...,xˉt∈Ti\bar{x}_1, ..., \bar{x}_t \in \mathcal{T}_ixˉ1​,...,xˉt​∈Ti​

Learning objective:

\min_\Theta \frac{1}{L} \sum_{i \leq L} \ell(\Phi(\mathcal{T}_i, \Theta), Y_i) + \mathcal{R}(\Theta)

(

\mathcal{R}

is the standard regularization objective)

Few shot learning setting

r=0, t=1, s=qK

\longrightarrow

q-\text{shot} \, K-\text{way}

Semi-supervised learning setting

r > 0, t=1

Model can use the auxiliary images(unlabeled set)

\{ \tilde{x}_1, ..., \tilde{x}_r \}

to improve the prediction accuracy, by leveraging the fact that these samples are drawn from the common distributions.

Active learning setting

The learner has the ability to request labels from the auxiliary images

\{\tilde{x}_1, ..., \tilde{x}_r\}

4. Model

•

ϕ(x)\phi(x)ϕ(x): CNN

•

h(l)h(l)h(l): One-hot encoded label(for labeled set), or uniform distribution(for unlabeled set)

4.1. Set and Graph Input Representations

The goal of few shot learning:

To propagate label information from labeled samples towards the unlabeled query image

→ The propagation can be formalized as a posterior inference over a graphical model

G_\mathcal{T} = (V,E)

Similarity measure is not pre-specified, but learned!

c.f.) in Siamese network, the similarity measure is fixed(L1 distance)!

본 논문(Few shot learning with GNN)에 쓰인 문장 구조가 이상해서 헷갈리게 쓰여있음.

Koch et al.(2015), https://tyami.github.io/deep learning/Siamese-neural-networks/

4.2. Graph Neural Networks

We are given an input signal

F \in \mathbb{R}^{V \times d}

on the vertices of a weighted graph

G

Then we consider a family, or a set "

\mathcal{A}

" of graph intrinsic linear operators.

\mathcal{A} = \{\tilde{A}^{(k)}, \mathbf{1}\}

•

Linear operator

e.g.) Simplest linear operator is adjacency operator

A

, where

(AF)_i = \sum_{j \sim i} w_{i,j}F_j

(

w_{i,j}

is associated weight)

GNN layer

A GNN layer

\text{Gc}(\cdot)

receives as input a signal

\mathbf{x}^{(k)} \in \mathbb{R}^{V\times d_k}

and produces

\mathbf{x}^{(k+1)} \in \mathbb{R}^{V\times d_{k+1}}

\mathbf{x}^{(k+1)} = \text{Gc}(\mathbf{x}^{(k)}) = \rho\Big(\sum_{B\in\mathcal{A}} B\mathbf{x}^{(k)}\theta^{(k)}_{B, l}\Big )

\mathbf{x}^{(k)}

: representation vector of a certain node at time step

k

\theta

: trainable parameters

\rho

: Leaky ReLU

Construction of edge feature matrix, inspired by message passing algorithm

\tilde{A}^{(k)}_{i, j} = \varphi_{\tilde{\theta}}(\mathbf{x}^{(k)}_i, \mathbf{x}^{(k)}_j )

•

A~i,j(k)\tilde{A}^{(k)}_{i, j}A~i,j(k)​: learned edge features from the node's current hidden representation(at time step kkk)

•

φ\varphiφ: a metric and a symmetric function parameterized with neural network

\varphi_{\tilde{\theta}}(\mathbf{x}^{(k)}_i, \mathbf{x}^{(k)}_j ) = \text{MLP}_{\tilde{\theta}}(abs(\mathbf{x}^{(k)}_i - \mathbf{x}^{(k)}_j))

→

\tilde{A}^{(k)}

is then normalized by row-wise softmax

→ And added to the family

\mathcal{A} = \{\tilde{A}^{(k)}, \mathbf{1}\}

•

1\mathbf{1}1: Identity matrix, which is the self-edge to aggregate vertex's own features

Construction of initial node features

\mathbf{x}^{(0)}_i = (\phi(x_i), h(l_i))

\phi

: convolutional neural network

h(l) \in \mathbb{R}^K_+

: a one-hot encoding of the label

For images with unknown label,

\tilde{x}_j

(unlabeled data) and

\bar{x}_j

(test data),

h(l_j)

is set with uniform distribution.

5. Training

5.1. Few-shot and Semi-supervised learning

The final layer of GNN is a softmax mapping. We then use cross-entropy loss:

\ell(\Phi(\mathcal{T}; \Theta), Y) = -\sum_k y_k \log P(Y_* = y_k \, |\, \mathcal{T})

The semi-supervised setting is trained identically, but the initial label fields of

\tilde{x}_j

s will be filled with uniform distribution.

5.2. Active learning (with attention)

In active learning, the model has the intrinsic ability to query for one of the labels from

\{ \tilde{x}_1, ..., \tilde{x}_r \}

The network will learn to ask for the most informative label to classify the sample

\bar{x}

The querying is done after the first layer of GNN by using a softmax attention over the unlabeled nodes of the graph.

Attention

We apply a function

g(\mathbf{x}^{(1)}_i) \in \mathbb{R}^1

that maps each unlabeld vector node to a scalar value.

A softmax is applied over the

\{1, ..., r\}

scalar values obtained after applying

g

r

: # unlabeled samples

\text{Attention} = \text{Softmax}(g(\mathbf{x}^{(1)}_{\{1,...,r\}}))

To query only one sample we set all elements to zero except for one. →

\text{Attention}'

•

At training, model randomly samples one value based on its multinomial probability.

•

At test, model just keeps the maximum value.

Then we multiply this with the label vectors

w \cdot h(l_{i*}) = \langle \text{Attention}', h(l_{\{1, ..., r\}}) \rangle

(

w

is scaling factor)

This value is then summed to the current representation.

\mathbf{x}^{(1)}_{i*} = [\text{Gc}(\mathbf{x}^{(0)}_{i*}), \mathbf{x}^{(0)}_{i*}] = [\text{Gc}(\mathbf{x}^{(0)}_{i*}), (\phi(x_{i*}), h(l_{i*}))]

6. Results

6.1. Few-shot learning

Omniglot

# of parameters:

\sim5\text{M} (\text{TCML})

\sim300 \text{K}(3 \text{ layers GNN})

Omniglot:

Omniglot

1,623 characters X 20 examples for each characters

Mini-ImageNet

# of parameters:

\sim 11\text{M} (\text{TCML})

\sim 400 \text{K}(3 \text{ layers GNN})

Mini-ImageNet:

Originally introduced by Vinyals et al.(2016)

Mini-ImageNet

Divided into 64 training, 16 validation, 20 testing classes each containing 600 examples.

6.2. Semi-supervised learning

Omniglot

Mini-ImageNet

6.3. Active learning

Random: Network chooses a random sample to be labeled, instead of one that maximally reduces the loss of the classification task

\mathcal{T}