Keywords: meta-learning, multimodal few-shot learning, vision and language models, transformers
TL;DR: Introducing a novel multimodal few-shot meta-learner, by leveraging large-scale frozen vision and language models.
Abstract: Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. As an effort to bridge this gap, we introduce a meta-learning approach for multimodal few-shot learning, to leverage its strong ability of accruing knowledge across tasks. The full model is based on frozen foundation vision and language models to benefit from their already learned capacity. To translate the visual features into the latent space of the language model, we introduce a light-weight meta-mapper acting as a meta-learner. By updating only the parameters of the meta-mapper, our model learns to quickly adapt to unseen samples with only a few gradient-step updates. Unlike prior multimodal few-shot learners, which need a hand-engineered task induction, our model is able to induce the task in a completely data-driven manner. Experiments on recent multimodal few-shot benchmarks demonstrate that compared to its counterparts our meta-learning approach yields better multimodal few-shot learners, while being computationally more efficient.