Meta-Transfer Learning for Few-Shot Learning

You are currently viewing Meta-Transfer Learning for Few-Shot Learning

Qianru Sun, Yaoyao Liu, Tat-Seng Chua, Bernt Schiele [NUS, MaxPlank] [CVPR2019]

Abstract

  • Meta-learning – to address the challenges of the few shot learning settings
  • Key idea – force a large number of similar few-shot tasks for learning a base-learner to a new task with few labels.
  • DNN overfit with few samples, but meta-learning uses shallow neural networks(SNN)
  • Contribution: Meta-Transfer Learning(MTL) – learns to adapt a DNN for few shot learning.
  • Meta – training multiple tasks
  • Transfer – achieved by learning scaling and shifting functions of DNN weights for each tasks
  • hard-task (HT) meta-batch: effective learning curriculum for MTL.
  • Benchmark: miniImageNet and Fewshot-CIFAR100 (5-class 1-shot and 5-class 5-shot)

Introduction

  • Few-shot learning – learn new concepts from few labeled examples. But in this CIFAR-100 archives only 40.1% accuracy for 1-shot learning
  •  Few-shot categorized into 2 classes.
    • Data augmentation:-data generator(conditioned on gaussian noise)-underperformed in 1-shot. 
    • Task-based meta learning:- meta-learning aims to accumulate learning from multiple tasks, while base-learning focuses on modeling the data distribution of single task. 
  • Model-Agnostic Meta Learning(MAML) – learns to search for optimal initialization state to fast adopt a base-learner to a new task. But limitations: requires large number of similar tasks->costly, and base learner is shallow NN to avoid overfitting, so unable to use DNN
  • MTL – novel learning method which converges faster with less probability to overfit.
  • Transfer – weight transfer with 2 lightweight neuron operations: scaling and shiftingαX + β.
  • 2nd contribution: effective meta-training curriculum. Curriculum learning and hard negative mining -> faster convergence and stronger performance. Inspired by this they designed hard task(HT) meta-batch strategy. HT meta-batch online re-samples harder ones according to past failure tasks with lowest validation accuracy.

Related Work 

  • Few-shot learning
    • Metric learning method: learn a similarity space in which learning is efficient
    • Memory network method: learn to store experience when learning seen task and generalize that to unseen tasks.
    • Gradient descent based methods: have a specific meta-learner, that learns to adapt a base learner, through different tasks.(MAML) – same as this
  • Transfer learning
    • Fine-tuning
    • Taking pre-trained networks as backbone and adding high-level functions (eg:object detection and recognition and image segmentation)
  • Curriculum learning & Hard sample mining
    • Curriculum learning: Instead of random sample observations, organize it in meaningful ways -> fast convergence, effective learning, better generalization
    • Hard sample mining: in object detection, it treats image proposals overlapped with ground-truth as hard negative samples. Training on more confusing data enables the model to achieve higher robustness & better performance  

Preliminary

  • Meta-learning: 2 phases on classification task, T(episode) samples from a p(T)distribution
    • Meta-train:- aims to learn from a number of episodes {T}
    • Meta-test
  • Meta-training phase: learn from multiple episodes. 2 stage optimization in each episode
    • Stage1, base-learning: – cross entropy loss to optimize parameters of base-learner
    • Stage2, feed-forward test on episode test data-points: test loss to optimize parameters of meta-learner.
  • Meta-test phase: test the fast adaptation to unseen task.

Methodology- 3 phases

  • DNN training on large-scale data
    • Eg: on miniImageNet(64-class, 600-shot) and then fix the low-level layers as feature extractor. 
    • 1st randomly initialize a feature extractor(conv layers in ResNets), and a classifier(last FC layers of ResNets), and then optimize them by GD
    • It will be frozen. And learned classifier will be discarded, bcz few-shot tasks have 5-class instead of 64.
  • Meta-transfer learning(MTL)
    • Learns scaling and shifting(SS) parameters for feature extractor neurons, enabling fast adaptation to few-shot tasks
    • SS through HT meta-batch training
    • The loss of T- to optimize the base-learner(classifier) `by GD, without updating (conv layers), also is different from previous phase(64 to 5 class)
  • Hard task (HT) meta-batch
    • Intentionally pickup failure cases in each task & recompose their data to be harder tasks for adverse retraining – “grow up through hardness”’
    • Pipeline: -> base learner optimized by loss of T(tr)-> SS parameters optimized by loss of T(te)once -> get recognition accuracy of T(te)for M classes. ->choose the lowest accuracy Accmto determine most difficult class-m
    • Choosing hard class-m: ranking, not threshold.
    • Two methods of hard tasking using m: chosen {m}, we resample tasks Thardby
      • Directly using samples of class-m in current task
      • Indirectly using the label of class-m to sample new samples of that class
  • Algorithm: Algo1-> training of large scale DNN & meta transfer learning, HT meta batch resampling. Failure classes by algo 2(learning process on single task)

Experiments

  • Datasets and Implementation details
    • miniImageNet:for few shot learning evaluation. 
    • Fewshot-CIFAR100(FC100)
  • Network architecture
    • Feature extractor: 2 options
      • 4CONV: 4 layers with 3×3 convolutions and 32 filters -> BN -> ReLU -> 2×2 max-pooling.
      • ResNet12: 4 residual blocks and each block with 3 conv layers with 3×3 kernels. End of each residual – 2×2 max-pooling layer. No of filters starts from 64 and doubled every next block

Conclusion

  • Top performance in tackling few-shot learning problem

Link: https://arxiv.org/pdf/1812.02391v3.pdf

Leave a Reply