Abstract: Humans acquire language in a compositional and grounded manner.They can describe their perceptual world using novel compositions from already learnt elementary concepts. However, recent research shows that modern neural networks lack such compositional generalization ability. To address this challenge, in this paper, we propose \textit{MetaVL}, a meta-transfer learning framework to train transformer-based vision-and-language (V\&L) models using optimization-based meta-learning method and episodic training.We carefully created two datasets based on MSCOCO and Flicker30K to specifically target novel compositional concept learning. Our empirical results have shown that \textit{MetaVL} outperforms baseline models in both datasets. Moreover, \textit{MetaVL} has demonstrated higher sample efficiency compared to supervised learning, especially under the few-shot setting.
Paper Type: long
0 Replies
Loading