November 2, 2021 · Kid ML

Meta-Transfer Learning for Few-Shot Learning

Qianru Sun, Yaoyao Liu, Tat-Seng Chua, Bernt Schiele [NUS, MaxPlank] [CVPR2019]

Abstract

Meta-learning – to address the challenges of the few shot learning settings
Key idea – force a large number of similar few-shot tasks for learning a base-learner to a new task with few labels.
DNN overfit with few samples, but meta-learning uses shallow neural networks(SNN)
Contribution: Meta-Transfer Learning(MTL) – learns to adapt a DNN for few shot learning.
Meta – training multiple tasks
Transfer – achieved by learning scaling and shifting functions of DNN weights for each tasks
hard-task (HT) meta-batch: effective learning curriculum for MTL.
Benchmark: miniImageNet and Fewshot-CIFAR100 (5-class 1-shot and 5-class 5-shot)

Introduction

Few-shot learning – learn new concepts from few labeled examples. But in this CIFAR-100 archives only 40.1% accuracy for 1-shot learning
Few-shot categorized into 2 classes.
- Data augmentation:-data generator(conditioned on gaussian noise)-underperformed in 1-shot.
- Task-based meta learning:- meta-learning aims to accumulate learning from multiple tasks, while base-learning focuses on modeling the data distribution of single task.
Model-Agnostic Meta Learning(MAML) – learns to search for optimal initialization state to fast adopt a base-learner to a new task. But limitations: requires large number of similar tasks->costly, and base learner is shallow NN to avoid overfitting, so unable to use DNN
MTL – novel learning method which converges faster with less probability to overfit.
Transfer – weight transfer with 2 lightweight neuron operations: scaling and shiftingαX + β.
2nd contribution: effective meta-training curriculum. Curriculum learning and hard negative mining -> faster convergence and stronger performance. Inspired by this they designed hard task(HT) meta-batch strategy. HT meta-batch online re-samples harder ones according to past failure tasks with lowest validation accuracy.

Related Work

Few-shot learning
- Metric learning method: learn a similarity space in which learning is efficient
- Memory network method: learn to store experience when learning seen task and generalize that to unseen tasks.
- Gradient descent based methods: have a specific meta-learner, that learns to adapt a base learner, through different tasks.(MAML) – same as this

Transfer learning
- Fine-tuning
- Taking pre-trained networks as backbone and adding high-level functions (eg:object detection and recognition and image segmentation)
Curriculum learning & Hard sample mining
- Curriculum learning: Instead of random sample observations, organize it in meaningful ways -> fast convergence, effective learning, better generalization
- Hard sample mining: in object detection, it treats image proposals overlapped with ground-truth as hard negative samples. Training on more confusing data enables the model to achieve higher robustness & better performance

Preliminary

Meta-learning: 2 phases on classification task, T(episode) samples from a p(T)distribution
- Meta-train:- aims to learn from a number of episodes {T}
- Meta-test
Meta-training phase: learn from multiple episodes. 2 stage optimization in each episode
- Stage1, base-learning: – cross entropy loss to optimize parameters of base-learner
- Stage2, feed-forward test on episode test data-points: test loss to optimize parameters of meta-learner.
Meta-test phase: test the fast adaptation to unseen task.

Methodology- 3 phases

DNN training on large-scale data
- Eg: on miniImageNet(64-class, 600-shot) and then fix the low-level layers as feature extractor.
- 1st randomly initialize a feature extractor(conv layers in ResNets), and a classifier(last FC layers of ResNets), and then optimize them by GD
- It will be frozen. And learned classifier will be discarded, bcz few-shot tasks have 5-class instead of 64.
Meta-transfer learning(MTL)
- Learns scaling and shifting(SS) parameters for feature extractor neurons, enabling fast adaptation to few-shot tasks
- SS through HT meta-batch training
- The loss of T- to optimize the base-learner(classifier) `by GD, without updating (conv layers), also is different from previous phase(64 to 5 class)
Hard task (HT) meta-batch
- Intentionally pickup failure cases in each task & recompose their data to be harder tasks for adverse retraining – “grow up through hardness”’
- Pipeline: -> base learner optimized by loss of T(tr)-> SS parameters optimized by loss of T(te)once -> get recognition accuracy of T(te)for M classes. ->choose the lowest accuracy Accmto determine most difficult class-m
- Choosing hard class-m: ranking, not threshold.
- Two methods of hard tasking using m: chosen {m}, we resample tasks Thardby
  - Directly using samples of class-m in current task
  - Indirectly using the label of class-m to sample new samples of that class
Algorithm: Algo1-> training of large scale DNN & meta transfer learning, HT meta batch resampling. Failure classes by algo 2(learning process on single task)

Experiments

Datasets and Implementation details
- miniImageNet:for few shot learning evaluation.
- Fewshot-CIFAR100(FC100)
Network architecture
- Feature extractor: 2 options
  - 4CONV: 4 layers with 3×3 convolutions and 32 filters -> BN -> ReLU -> 2×2 max-pooling.
  - ResNet12: 4 residual blocks and each block with 3 conv layers with 3×3 kernels. End of each residual – 2×2 max-pooling layer. No of filters starts from 64 and doubled every next block

Conclusion

Top performance in tackling few-shot learning problem

Link: https://arxiv.org/pdf/1812.02391v3.pdf

Abstract

Meta-learning – to address the challenges of the few shot learning settings

Key idea – force a large number of similar few-shot tasks for learning a base-learner to a new task with few labels.

DNN overfit with few samples, but meta-learning uses shallow neural networks(SNN)

Contribution: Meta-Transfer Learning(MTL) – learns to adapt a DNN for few shot learning.

Meta – training multiple tasks

Transfer – achieved by learning scaling and shifting functions of DNN weights for each tasks

hard-task (HT) meta-batch: effective learning curriculum for MTL.

Benchmark: miniImageNet and Fewshot-CIFAR100 (5-class 1-shot and 5-class 5-shot)

Introduction

Few-shot learning – learn new concepts from few labeled examples. But in this CIFAR-100 archives only 40.1% accuracy for 1-shot learning

Few-shot categorized into 2 classes.

Data augmentation:-data generator(conditioned on gaussian noise)-underperformed in 1-shot.
Task-based meta learning:- meta-learning aims to accumulate learning from multiple tasks, while base-learning focuses on modeling the data distribution of single task.

Model-Agnostic Meta Learning(MAML) – learns to search for optimal initialization state to fast adopt a base-learner to a new task. But limitations: requires large number of similar tasks->costly, and base learner is shallow NN to avoid overfitting, so unable to use DNN

MTL – novel learning method which converges faster with less probability to overfit.

Transfer – weight transfer with 2 lightweight neuron operations: scaling and shiftingαX + β.

2nd contribution: effective meta-training curriculum. Curriculum learning and hard negative mining -> faster convergence and stronger performance. Inspired by this they designed hard task(HT) meta-batch strategy. HT meta-batch online re-samples harder ones according to past failure tasks with lowest validation accuracy.

Related Work

Few-shot learning

Metric learning method: learn a similarity space in which learning is efficient
Memory network method: learn to store experience when learning seen task and generalize that to unseen tasks.
Gradient descent based methods: have a specific meta-learner, that learns to adapt a base learner, through different tasks.(MAML) – same as this

Transfer learning

Fine-tuning
Taking pre-trained networks as backbone and adding high-level functions (eg:object detection and recognition and image segmentation)

Curriculum learning & Hard sample mining

Curriculum learning: Instead of random sample observations, organize it in meaningful ways -> fast convergence, effective learning, better generalization
Hard sample mining: in object detection, it treats image proposals overlapped with ground-truth as hard negative samples. Training on more confusing data enables the model to achieve higher robustness & better performance

Preliminary

Meta-learning: 2 phases on classification task, T(episode) samples from a p(T)distribution

Meta-train:- aims to learn from a number of episodes {T}
Meta-test

Meta-training phase: learn from multiple episodes. 2 stage optimization in each episode

Stage1, base-learning: – cross entropy loss to optimize parameters of base-learner
Stage2, feed-forward test on episode test data-points: test loss to optimize parameters of meta-learner.

Meta-test phase: test the fast adaptation to unseen task.

Methodology- 3 phases

DNN training on large-scale data

Eg: on miniImageNet(64-class, 600-shot) and then fix the low-level layers as feature extractor.
1st randomly initialize a feature extractor(conv layers in ResNets), and a classifier(last FC layers of ResNets), and then optimize them by GD
It will be frozen. And learned classifier will be discarded, bcz few-shot tasks have 5-class instead of 64.

Meta-transfer learning(MTL)

Learns scaling and shifting(SS) parameters for feature extractor neurons, enabling fast adaptation to few-shot tasks
SS through HT meta-batch training
The loss of T- to optimize the base-learner(classifier) `by GD, without updating (conv layers), also is different from previous phase(64 to 5 class)

Hard task (HT) meta-batch

Intentionally pickup failure cases in each task & recompose their data to be harder tasks for adverse retraining – “grow up through hardness”’
Pipeline: -> base learner optimized by loss of T(tr)-> SS parameters optimized by loss of T(te)once -> get recognition accuracy of T(te)for M classes. ->choose the lowest accuracy Accmto determine most difficult class-m
Choosing hard class-m: ranking, not threshold.
Two methods of hard tasking using m: chosen {m}, we resample tasks Thardby
- Directly using samples of class-m in current task
- Indirectly using the label of class-m to sample new samples of that class

Algorithm: Algo1-> training of large scale DNN & meta transfer learning, HT meta batch resampling. Failure classes by algo 2(learning process on single task)

Experiments

Datasets and Implementation details

miniImageNet:for few shot learning evaluation.
Fewshot-CIFAR100(FC100)

Network architecture

Feature extractor: 2 options
- 4CONV: 4 layers with 3×3 convolutions and 32 filters -> BN -> ReLU -> 2×2 max-pooling.
- ResNet12: 4 residual blocks and each block with 3 conv layers with 3×3 kernels. End of each residual – 2×2 max-pooling layer. No of filters starts from 64 and doubled every next block