Keywords: Continual Learning, Language Grounding, Language-Image Embeddings, Multimodal Distributional Semantics, Reference Resolution
Abstract: This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use. We verify the model's performance on two different tasks of continual learning and show that it can efficiently learn and generalize from only a few examples, with little interference with the model's original zero-shot performance.
One-sentence Summary: We present a new model for continual learning of how language is grounded in vision, through multimodal embeddings.
Supplementary Material: zip
19 Replies
Loading