Qianru Sun, Yaoyao Liu, Tat-Seng Chua, Bernt Schiele [NUS, MaxPlank] [CVPR2019]
Abstract
- Meta-learning – to address the challenges of the few shot learning settings
- Key idea – force a large number of similar few-shot tasks for learning a base-learner to a new task with few labels.
- DNN overfit with few samples, but meta-learning uses shallow neural networks(SNN)
- Contribution: Meta-Transfer Learning(MTL) – learns to adapt a DNN for few shot learning.
- Meta – training multiple tasks
- Transfer – achieved by learning scaling and shifting functions of DNN weights for each tasks
- hard-task (HT) meta-batch: effective learning curriculum for MTL.
- Benchmark: miniImageNet and Fewshot-CIFAR100 (5-class 1-shot and 5-class 5-shot)
Introduction
- Few-shot learning – learn new concepts from few labeled examples. But in this CIFAR-100 archives only 40.1% accuracy for 1-shot learning
- Few-shot categorized into 2 classes.
- Data augmentation:-data generator(conditioned on gaussian noise)-underperformed in 1-shot.
- Task-based meta learning:- meta-learning aims to accumulate learning from multiple tasks, while base-learning focuses on modeling the data distribution of single task.
- Model-Agnostic Meta Learning(MAML) – learns to search for optimal initialization state to fast adopt a base-learner to a new task. But limitations: requires large number of similar tasks->costly, and base learner is shallow NN to avoid overfitting, so unable to use DNN
- MTL – novel learning method which converges faster with less probability to overfit.
- Transfer – weight transfer with 2 lightweight neuron operations: scaling and shiftingαX + β.
- 2nd contribution: effective meta-training curriculum. Curriculum learning and hard negative mining -> faster convergence and stronger performance. Inspired by this they designed hard task(HT) meta-batch strategy. HT meta-batch online re-samples harder ones according to past failure tasks with lowest validation accuracy.
Related Work
- Few-shot learning
- Metric learning method: learn a similarity space in which learning is efficient
- Memory network method: learn to store experience when learning seen task and generalize that to unseen tasks.
- Gradient descent based methods: have a specific meta-learner, that learns to adapt a base learner, through different tasks.(MAML) – same as this
- Transfer learning
- Fine-tuning
- Taking pre-trained networks as backbone and adding high-level functions (eg:object detection and recognition and image segmentation)
- Curriculum learning & Hard sample mining
- Curriculum learning: Instead of random sample observations, organize it in meaningful ways -> fast convergence, effective learning, better generalization
- Hard sample mining: in object detection, it treats image proposals overlapped with ground-truth as hard negative samples. Training on more confusing data enables the model to achieve higher robustness & better performance
Preliminary
- Meta-learning: 2 phases on classification task, T(episode) samples from a p(T)distribution
- Meta-train:- aims to learn from a number of episodes {T}
- Meta-test
- Meta-training phase: learn from multiple episodes. 2 stage optimization in each episode
- Stage1, base-learning: – cross entropy loss to optimize parameters of base-learner
- Stage2, feed-forward test on episode test data-points: test loss to optimize parameters of meta-learner.
- Meta-test phase: test the fast adaptation to unseen task.
Methodology- 3 phases
- DNN training on large-scale data
- Eg: on miniImageNet(64-class, 600-shot) and then fix the low-level layers as feature extractor.
- 1st randomly initialize a feature extractor(conv layers in ResNets), and a classifier(last FC layers of ResNets), and then optimize them by GD
- It will be frozen. And learned classifier will be discarded, bcz few-shot tasks have 5-class instead of 64.
- Meta-transfer learning(MTL)
- Learns scaling and shifting(SS) parameters for feature extractor neurons, enabling fast adaptation to few-shot tasks
- SS through HT meta-batch training
- The loss of T- to optimize the base-learner(classifier) `by GD, without updating (conv layers), also is different from previous phase(64 to 5 class)
- Hard task (HT) meta-batch
- Intentionally pickup failure cases in each task & recompose their data to be harder tasks for adverse retraining – “grow up through hardness”’
- Pipeline: -> base learner optimized by loss of T(tr)-> SS parameters optimized by loss of T(te)once -> get recognition accuracy of T(te)for M classes. ->choose the lowest accuracy Accmto determine most difficult class-m
- Choosing hard class-m: ranking, not threshold.
- Two methods of hard tasking using m: chosen {m}, we resample tasks Thardby
- Directly using samples of class-m in current task
- Indirectly using the label of class-m to sample new samples of that class
- Algorithm: Algo1-> training of large scale DNN & meta transfer learning, HT meta batch resampling. Failure classes by algo 2(learning process on single task)
Experiments
- Datasets and Implementation details
- miniImageNet:for few shot learning evaluation.
- Fewshot-CIFAR100(FC100)
- Network architecture
- Feature extractor: 2 options
- 4CONV: 4 layers with 3×3 convolutions and 32 filters -> BN -> ReLU -> 2×2 max-pooling.
- ResNet12: 4 residual blocks and each block with 3 conv layers with 3×3 kernels. End of each residual – 2×2 max-pooling layer. No of filters starts from 64 and doubled every next block
- Feature extractor: 2 options
Conclusion
- Top performance in tackling few-shot learning problem