- Abstract: The state-of-the-art deep learning algorithms generally require large amounts of data for model training. Lack thereof can severely deteriorate the performance. To this end, we propose a multi-modal approach that facilitates bridging the information gap by means of meaningful joint embeddings. Specifically, we present a benchmark that is multi-modal during training (i.e. images and texts) and single-modal in testing time (i.e. images), with the associated task to utilize multi-modal data in base classes (with many samples), to learn explicit visual classifiers for novel classes (with few samples). Next, we propose an framework built upon the idea of cross-modal data hallucination. In this regard, we introduce a discriminative text-conditional GAN for sample generation with a simple self-paced strategy for sample selection. Experiments on our proposed benchmark demonstrate that learning generative models in a cross-modal fashion facilitates few-shot learning by compensating the lack of data in novel categories.
- Keywords: Few-Shot Learning, Multi-Modal, Fine-grained Recognition, Meta-Learning
- TL;DR: We propose a benchmark for few-shot learning in multi-modal scenarios in conjunction with an approach including a discriminative text-conditional GAN for cross-modal sample generation with a simple self-paced strategy for sample selection.