Prior Knowledge for Few-shot Learning

Few-shot learning is an important technique that can improve the learning capabilities of machine intelligence and practical adaptive applications. Previous researchers apply the meta-learning strategy to endow the new model with the ability or leverage transfer learning to alleviate the challenge of data-hungry. Moreover, prior knowledge such as knowledge graphs can also be modeled under the few-shot setting. This post gives an overview of recent works about how prior knowledge can address the problem of few-shot learning, and discusses a simple and efficient few-shot learning approach that estimates the novel class distributions derived inductively from the base classes.

Introduction

Humans can adapt to a novel task from only a few observations, because our brains have excellent capability of learning to learn. In contrast, modern artificial intelligence (AI) systems generally require a large amount of annotated samples to make the adaptations. Few shot learning (FSL) becomes an important and widely studied problem. Different from conventional machine learning, FSL aims to learn prior knowledge on base classes with large amounts of labeled data and utilize the knowledge to recognize few-shot classes with scarce labeled data.

Exisiting studies on FSL roughly fall into two categories, namely the metric-based learning and optimization-based methods. The common methodology of metric-based learning algorithms is that classifying test samples by matching them to the nearest class prototype. However, training on the few labeled samples may get a biased distribution (or biased prototype), specially in one-shot learning scenario. As show in Figure 1, the given few labeled samples may be far away from its ground-truth centers in the case of large variances for novel classes. Hence, this is a meaningful question that how to estimate representative prototypes from the few labeled samples.

Distribution
Figure 1. The distribution of base and novel class samples in the pretrained feature space.

Review the Past

Prior knowledge is a essential element to alleviate the problem of having an unreliable empirical risk minimizer in the FSL supervised learning.

Few-Shot Learning

Existing FSL works can be categorized into the following perspectives: Data, Model and Algorithm. Moreover, the complementation of these three perspective could be more robust to solve the problem of FSL, which have been demonstrated in recent works.

Data

Augment on Training Dataset Data augmentation on original training dataset via hand-crafted rules is usually used as pre-processing in FSL methods. For example, on NLP domain, one can use back-translation, word replacement, cutoff, and adversial training. In addition, several strategies that combine different augmentation methods have been proposed, such as applying multiple transformations sequentially [7] . However, the capacity of data augment is limited to solve the FSL problem. For example, the existed approaches may be specific to one domain, making them hard to be applied to other domains.

result
Figure 3. A taxonomy of FSL methods.

Utilizing Weakly Labeled or Unlabeled Data How to sufficiently leverage weakly labeled or unlabeled data along with limited labeled data (i.e., semi-supervised learning, SSL) [8] is always hot topic in the field of machine learning, which could be mainly categorized as self-training and consistency regularization . However, it has been shown that a necessary condition, labeled data and unlabeled data with pseudo label come from the same distribution during the training process, is hard to hold in real application. Thus, a popular idea [9] [10] a proposed in recent works to select a subset of training examples from unlabeled examples for SSL.

Transforming Samples from Similar Datasets This strategy augments training data by aggregating and adapting input-output pairs from a similar but larger data sets. The aggregation weight is usually based on some similarity measure between samples.

Model

Multitask Learning In the presence of multiple related tasks, multitask learning learns these tasks simultaneously by exploiting both task-generic and task-specific information. The core issue of the multitask learning appied to the FSL is the design of related task composition, which depends heavily on domain knowledge.

Metric Learning For the FSL problem, researchers proposed simple but effective algorithms based on metric learning. For example, MatchingNet and ProtoNet learned to classify samples by comparing the distance to the representatives of each class.

Generative Model Generation model is often design to compensate for the insufficient number of available samples by generation. Most methods use the idea of Generative Adversarial Networks (GANs) or autoencoder to generate samples or features to augment the training set. Specifically, Yoo et al. proposes a novel data augmentation technique that leverages large-scale language models (e.g., GPT-3) to generate realistic text samples from a mixture of real samples.

Learning with External Memory Recently, constructing key-value memory extracted from train dataset, and infering the examples by retrieving the memory database based on the similarity between query and key have shown some promising results. It is important that the whole process is non-parametric and requires no parameter update. The cache model has been adopted for improving language generation in kNN-LMs [13] [15]. Moreover, Zhang et al. [14] explore it with CLIP and adopt the few-shot setting.

Algorithm

Refining Pretrained Model This strategy takes a pre-trained model learned from related tasks as a good initialization, and adapts it to a new task. The assumption is that captures some general structures of the large-scale data. Therefore, it can be adapted to a new task with limited labeled data in a few iterations.

Learning Optimizer One of the most general algorithms for meta-learning is the optimization-based algorithm. Finn et al. [11] and Li et al. [12] proposed to learn how to optimize the gradient descent procedure so that the learner can have a good initialization, update direction, and learning rate.

Recently, Yang et.al, [1] calibrated the distribution of these few-sample classes by transferring statistics from the classes with sufficient examples. Then an adequate number of examples can be sampled from the calibrated distribution as data augmentation technique. Their approach could achieve the state-of-the-art accuracy on three datasets (5% improvement on miniImageNet).

Distribution Calibration

Motivation

As pointed out above, the few samples may cause the biased distribution estimation, which can damage the generalization of ability of the model. The authors observed that the feature distribution of similar classes usually exists the similar statistics (e.g., mean and variance), as shown in table 1. Meanwhile, the statistics can be estimated more accurately when there are adequate samples for this class. Based on these observations, the authors proposed that transfer the statistics of many-shot classes to estimate the distribution of the few-shot classes according to the similarity in the semantic similarity between classes.

Arctic fox Mean sim Var sim
white wolf 97% 97%
malamute 85% 78%
lion 81% 70%
meerkat 78% 70%
Table 1. The class statistics similarity between Arctic fox and different classes.

Method

This work follows a typical few-shot classification setting. It is formed as N-way K-shot few-shot tasks where each task consists of N few-shot classes with K labeled samples per class (the support set) and some unlabeled samples (the query set) for test. The training procedure for an N-way-K-shot task is shown in Algorithm 1 below.

algorithm

First, the authors assume the feature distribution of base classes is Gaussian. The mean and covarience of the feature vector from a base class $i$ are calculated as the mean and varience of every single dimension in the vector. Secondly, To make the feature distribution more Gaussian-like, a key step is transforming the features of the support set and query set in the target task using Tukey’s Ladder of Powers transformation.

During distribution calibration step, the transfer of statistics is based on the Euclidean distance between the feature space of the novel classes and the mean of the features from the base classes. Specially, top $k$ base classes will be selected to construct the calibrated distribution: \(\boldsymbol{\mu}'=\frac{\sum_{i \in S_{N}}{\boldsymbol{\mu}_i+\tilde{\boldsymbol{x}}}}{k+1}, \boldsymbol{\Sigma}'=\frac{\sum_{i \in S_{N}}{\boldsymbol{\Sigma}_i}}{k}+\alpha\)

After obtaining the calibrated distribution, a sufficient feature vectors with label could be generated by sampling from the calibrated Gaussian distributions to train a classifier.

Analysis

Strengths As shown in Table 2, simple linear classifier equipped with data cablibration method perform better than the state-of-the-art few-shot classification method and achieve the best performance on 1-shot and 5-shot settings of miniImageNet, tieredImageNet and CUB. Specifically, the performance of DC surpasses the state-of-the-art method by 10% for the 5way1shot setting, which proves that calibrating distribution can handle extremely low-shot classification tasks better through modeling the association between base classes and novel classes.

result
Table 2. Performance on miniImageNet and CUB.

Compared with those generative model to generate extra samples or features for training, this distribution calibration strategy is simple and dose not need extra learnable parameters. Note that in the Figure 2, the generated features sampled from calibrated distribution can overlap areas of the query set, which means that training with these generated features as data augment technique could improve the generalization ability.

TSNT
Figure 2. t-SNE visualization of distribution estimation.

Limitations Aforementioned approach exists some limiting assumptions. First, this work relies on the reasonable distribution assumption (Gaussian in this work), which may exists the generality problem to other tasks. Second, this method implicitly assumed that the novel classes inevitably exist the association with the certain base classes (topK base classes), and did not consider the similarity strength between the base and novel classes when estimating novel class statistics. Recent work [2] also thinks that this method implicitly assumed that the base classes were semantically independent of each other when constructing covariance estimates.

Bias Correction

Bias correction is a idea worth refering to solve above-mentioned problem. The distribution calibration based on the many-shot samples is one of representative to rectify the distribution. The main methodologies of bias correction for few-shot learning could be summed up as the reconstruction-based, utilizing primitive knowledge and utilizing extra data methods.

  • Reconstruction-based: It is a class of method [3][6] to construct a pair of noise (biased) prototype and target (representative) prototype and train a regression model to restore the prototype. Despite existing the necessity of designing complex model , this method does not depend on extra data which may be not suitable for some scenarios.
  • Utilizing primitive knowledge: Recent works have demonstrated that rectifying prototype with primitive knowledge could achieve prominent improvement for few-shot learning. Zhang et al. [4] design a framework introduces WordNet (i.e., attribute annotations) as extra knowledge, extracts representative attribute features as priors and complete prototype with these priors.
  • Utilizing extra data: Due to considering the many-shot data, aforementioned approach could be view as utilizing extra data. Another line of approaches is to leverage unlabeled samples, which exists some intersection with semi-supervised learning.

Look Forward To the Future

Future = Large Model ? The recent GPT-3 model achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. And Gao et al. proposed a suite of simple and complementary techniques for fine- tuning language models on a small number of annotated examples, LM-BFF. However, the common problem with these works appied in the FSL are limited in specific task (e.g., simple text classification tasks). Large models pretrained on large data may have strong perceptive ability, but they have poor performance on those tasks involved with inductive reasoning.

Future = Large Model + Knowledge ? From the perspective of bias correction, it is still difficult to design a efficient model to avoid the risk of overfitting the base tasks without any extra constraint condition (or inductive bias). It may be a possible direction that combining leveraging extra data (or knowledge) and model design to break the dilemma of few-shot learning. More and more researchers begin to explore incorporating prior knowledge within large model, which may boost the ability of fast adaption with limited experience.

References

[1] Free Lunch for Few-shot Learning: Distribution Calibration. (ICLR 2021)
[2] Generalized Distribution Calibration for Few-Shot Learning. (Preprint)
[3] One-Shot Image Classification by Learning to Restore Prototypes. (AAAI 2020)
[4] Prototype Completion with Primitive Knowledge for Few-Shot Learning. (CVPR 2021)
[5] Prototype Rectification for Few-Shot Learning. (ECCV 2020)
[6] Learning to Learn: Model Regression Networks for Easy Small Sample Learning. (ECCV 2016)
[7] CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding. (ICLR 2021)
[8] MixMatch: A Holistic Approach to Semi-Supervised Learning. (NeurIPS 2019)
[9] FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. (NeurIPS 2020)
[10] Dash: Semi-Supervised Learning with Dynamic Thresholding (NeuIPS 2021)
[11] Model-agnostic meta-learning for fast adaptation of deep networks. (ICML 2017)
[12] Meta-sgd: Learning to learn quickly for few shot learning. (Preprint)
[13] Generalization through memorization: Nearest neighbor language models. (ICLR 2020)
[14] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. (Preprint)
[15] KNN-BERT: Fine-Tuning Pre-Trained Models with KNN Classifier (Preprint)