Representation Change in Model-Agnostic Meta-Learning
01 Dec 2021 | meta-learning maml representation change representation reuseLast year, an exciting adaptation of one of the most popular optimization-based meta-learning approaches, model-agnostic meta-learning (MAML) [Finn et al., 2017], was proposed in
- Jaehoon Oh, Hyungjun Yoo, ChangHwan Kim, Se-Young Yun (ICLR, 2021) BOIL: Towards Representation Change for Few-shot Learning
The authors adapt MAML by freezing the last layer to force body only inner learning (BOIL). Interestingly, this is complementary to ANIL (almost no inner loop) proposed in
- Aniruddh Raghu, Maithra Raghu, Samy Bengio, Oriol Vinyals (ICLR, 2020) Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
Both papers attempt to understand the success of MAML and improve it. Oh et al. [2021] compare BOIL, ANIL, and MAML and show that both improve the performance of MAML, but BOIL outperforms ANIL, especially when the task distribution varies between training and testing.
MAML
Before studying BOIL and ANIL, it is worth recalling how MAML works, as it forms the basis of both algorithms. MAML learns an initialization using second-order methods across tasks from the same distribution. The optimization is done in two nested loops (bi-level optimization), with meta-optimization happening in the outer loop. The entire optimization objective can be expressed as:
\[\begin{equation}\label{equ:outer} \theta^* := \underset{\theta \in \Theta}{\mathrm{argmin}} \frac{1}{M} \sum_{i=1}^M \mathcal{L}(in(\theta, \mathcal{D}_i^{tr}), \mathcal{D}_i^{test}), \end{equation}\]where $M$ describes the number of tasks in a batch, and $\mathcal{D}_i^{tr}$ and $\mathcal{D}_i^{test}$ are the training and test set of the task $i$. The function $\mathcal{L}$ describes the task loss and the function $in(\theta, \mathcal{D}_i^{tr})$ represents the inner loop. For every task $i$ in a batch, the neural network is initialized with $\theta$. In the inner loop, this value is optimized for one or a few training steps of gradient descent on the training set $\mathcal{D}_i^{tr}$ to obtain fine-tuned task parameters $\phi_i$. Taking only one training step in the inner loop, for example, the task parameters correspond to
\[\begin{equation} \phi_i \equiv in(\theta, \mathcal{D}_i^{tr}) = \theta - \alpha \nabla_{\theta} \mathcal{L}(\theta, \mathcal{D}_i^{tr}). \end{equation}\]The meta-parameters $\theta$ are now updated with respect to the average loss of each tasks’ fine-tuned parameters $\phi_i$ on test set $\mathcal{D}_i^{test}$. Thus, MAML optimizes with respect to the loss after fine-tuning, having far superior performance to simple pre-training as is outlined in the original publication. Many different adaptations to MAML improve learning speed or performance and solve a variety of new tasks and task distributions. A deeper introduction and an interactive comparison of some variants can be found here.
Freezing layers
In the standard version of MAML and most of its variants, all parameters of the meta-optimized model are updated in the inner loop. However, Raghu et al. [2020] have discovered that during fine-tuning in the inner loop, representations of the body (convolutional layers before the fully-connected head) of the network hardly change (further details can be found two sections below). Therefore, the authors propose to skip updating the network body altogether, saving a significant amount of time, as now the expensive second-order updates are only required for the network head. Additionally, they observe regularization effects that further improved the model’s performance. The authors’ empirical results confirm a slight increase in performance while achieving an optimization speed by a factor of $1.7$ during training and $4.1$ during inference. They conclude that MAML rather reuses the features than rapidly learns. Here, a reuse of features is attributed to layers whose performance does not rely on a change of representation in the inner loop (which, according to the authors, goes along with small changes in the layers’ weights). Rapid learning can therefore be found only in the head, where a lot of change happens during fine-tuning.
In addition, the authors of ANIL propose an algorithm where the head is dropped entirely, and only distances of the representations are used to classify samples, resulting in the NIL (no inter loop) algorithm, showing at least comparable performance.
Instead of simplifying the fine-tuning for MAML, Oh et al. [2021] come up with the complementary idea. If MAML does not change the representations in the first part of the network, why not force the network to freeze the last layer? Then, the neural network has to update the first part of the network to fulfill the few-shot learning task.
The differences in learning can be seen on the right-side figure. The points describe the representations after the feature extractor (in images task usually the convolutional part), and the two lines depict the (linear) classifier. MAML hardly updates the representation, but the classifier changes clearly. ANIL only changes the linear classifier, whereas BOIL changes only the representations.
For the reasons of clarity, Oh et al. [2021] re-express feature reuse to representation reuse and rapid learning to representation change. In the following, we will also use the re-expressed terms. In addition, they claim that representation reuse is necessary to solve cross-domain tasks and
BOIL is closer to the ultimate goal of meta-learning, which is a domain-agnostic adaptation. The authors of BOIL verify their claims via experiments discussed in the next session.
Experiments
Oh et al. [2021] compare the performance of both ANIL and BOIL with MAML. For the results reported in the figure below, the 4-conv network with 64 channels from Vinyals et al. [2016] is applied on multiple datasets. For most of the datasets, BOIL outperforms MAML and ANIL significantly. This holds for few-shot learning tasks based on coarse-grained datasets and fine-grained datasets. In addition, experiments are performed where the meta-train tasks are not coming from the same distribution mixing more general datasets with specific ones. In the classical case of transfer learning to use a trained model on general data for specific data, BOIL outperforms the other methods significantly.
All in all, we observe that BOIL outperforms ANIL and MAML for most of the cases, and ANIL mostly outperforms MAML. But to answer whether representation change or reuse is dominant, we believe that focussing only on the predictive performance is insufficient.
Representation Similarity Analysis
Both ANIL and BOIL use representation similarity analysis to justify their hypotheses, specifically to claim representation change or representation reuse for different layers. We would like to extend this discussion and reason about the similarity analysis results. Representation similarity analysis in both ANIL and BOIL was applied on query sets with 5 classes and 15 images from the miniImageNet dataset.
Center Kernel Alignment
After applying center kernel alignment (CKA) [Kornblith et al., 2019] on the representation of MAML before and after the inner loop updates, it can be observed that the similarity in the last layer of the model changes a lot during fine-tuning, whereas all of the other layers change only barely. As the assignment of the label in few-show learning is completely random, this is not surprising. Looking at how tasks are generated, we know that there could be two tasks that have exactly the same data but a different order of the labels.
Referring the data to the correct one-hot encoded neuron is connected with a large change in similarity. Interestingly, it seems that almost all task-specific adaptation happens in the last layer, as we have seen with ANIL. However, it might also be the case that the data from the distributions studied in MAML and ANIL are simply too similar. Results from Oh et al. [2021] could hint at this, as in cross-domain tasks, MAML and ANIL are not performing well. In addition, it has been shown that already during training of the meta-optimization, the change in similarity is to a magnitude smaller in earlier layers than in later ones [Goerttler & Obermayer, 2021]. This can even be observed in classical machine learning, e.g., when applying a convolution network on MNIST.
Looking at the results of CKA on BOIL, we observe that there is more change in the representation of convolution layer 4. However, in earlier layers, similarity stays similarly small as observed in MAML. This raised the question by one of the reviewers if the penultimate layer is just replacing the head layer and leaving this a random transformation. The authors answer this by saying that the penultimate layer of BOIL acts as a non-linear transformation.
Cosine Similarity
In addition to CKA, the authors of BOIL explore the layer-wise alteration of the representation with the cosine similarities of the four convolution modules. They compare all the similarities between the samples with the same class (intra-class) and between samples with a different class (inter-class). The results can be seen on the right side.
We observe that the similarity does not change for MAML and ANIL. In addition, their pattern is similar, and both hint at representation reuse. Their pattern - monotonically decreasing to make the representations separable - does not change during fine-tuning. We see that the intra-class similarity is higher than the intra-class similarity, hinting at a separated representation. Hence, MAML’s and ANIL’s effectiveness depends on the meta-initialized body and not on task-specific adaptation.
BOILs pattern is different. It only decreases until convolutional block three. Also, the intra-class and inter-class similarities are not significantly different. Therefore, the representations resulting from the meta-initialized body cannot be classified. However, we see that the inter-class cosine similarities have decreased after fine-tuning the model. After adaptation, the representations can be classified (and as we know, even from a not fine-tuned head).
Although the similarity of convolutional blocks one to three do not change, the authors justify this with a general peculiarity of the convolutional body [Oh et al., 2021]. As the gradient norms in convolution modules, one to three are also higher than in MAML and ANIL, they conclude that the representation reuse is lower.
Discussion
MAML is a great meta-learning algorithm and very flexible due to its bi-level optimization setup. However, it is sometimes criticized that it does not really learn to learn (this term is often used equivalent to meta-learning) but only learns a good average across similar tasks. One could think that (A)NIL even underlines the argument by making inner update loops (almost) useless. Luckily, BOIL could show that fast adaptations in earlier layers lead to more success, although the MAML algorithm has to be forced to do so because otherwise, it only updates the last layers. Albeit, we think that the discussion about whether MAML truly learns to learn has more to do with the few-shot learning setup. Although only seeing a few samples, the network is usually still trained on a large dataset. In the end, it is probably also a question of what your definition of meta-learning and your expectation is. We hope that in the future also other, more difficult few-shot tasks will become popular where the samples of the distribution of tasks are more dissimilar, e.g., the Meta-Dataset [Triantafillou et al., 2020].
Regardless of whether you want to call it meta-learning or learning to learn, the performance of MAML is pretty good, and we think the idea of MAML is still splendid. However, we think that the question of whether MAML rather reuses representation or changes them cannot simply be answered with the current state of experiments. First of all, simple convolution networks also have a small representation change in early layers. Secondly, also the change in the meta-learning phase is smaller there. In addition, although BOIL changes the penultimate layer, it does not change anything significantly before. However, BOIL increases the performance significantly across many different task-datasets, wherefore we still think its approach is very important for the future application and understanding of MAML, particularly in that it indicates a natural superiority of a high amount of representation change in a convolutional layer, as opposed to the head.