Emergent Corpus Pretraining Benefits Vision Language Modeling

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: emergent communication, vision language pretraining, corpus transfer
TL;DR: We pre-train Vision Language Model on a corpus of Emergent Communication tokens and show that this pre-trained model improves performance on VL downstream tasks like Visual Referring Expression and Visual Question Answering.
Abstract: Vision Language Pre-trained Models (VL-PTMs) have achieved state-of-the-art results across various tasks, but their effectiveness heavily relies on large-scale multimodal datasets. While VL-PTMs excel in scenarios with abundant data, they struggle to achieve sample efficiency in tasks with limited data resources. In this work, we explore the use of Emergent Communication (EC) for knowledge transfer in VL-PTMs. In particular, we pre-train a state-of-the-art Vision Language (VL) model on a corpus of EC tokens, generated through a referential game involving two artificial agents. Through experiments on three diverse cross-modal matching and reasoning tasks, we demonstrate significant performance improvements. For instance, EC pretraining enhances Visual Referring Expression (VRE) accuracy by $108.6\%$ while improving Visual Entailment (VE) performance by $69.6\%$. We further demonstrate that a vision-language model, exclusively pre-trained on EC tokens from scratch utilizing a sequence-to-sequence learning objective, can be effectively leveraged for fine-tuning numerous other vision-language downstream tasks, outperforming baseline settings without any pretraining and in some cases significantly narrowing the performance gap with models pre-trained on natural language. These results highlight the transferability and generalization capabilities of EC pretraining across different VL tasks and the potential of leveraging the multimodal grounding of EC tokens to enhance VL understanding in resource-constrained settings, especially in settings with limited natural language data. We discuss implications and propose avenues for future research to explore the connections between EC and VL for multimodal understanding and effective human-machine communication.
Supplementary Material: pdf
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8721
Loading