Understanding Multimodal Instruction Format for In-context Learning

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Visual instruction tuning, in-context learning, instruction format
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: The field of vision and language machine learning has witnessed a surge in interest regarding in-context learning—a technique that enables rapid adaptation to new tasks with just a handful of annotated examples. To bolster the in-context learning capabilities of multimodal vision and language models, researchers have explored various instruction tuning formats. In this paper, we aim to study what should be the effective format for enhancing the in-context learning ability for vision and language models. We propose Unified Multimodal Instruction Tuning (UMIT), a framework to suggest how to construct a text-image interleaved instruction dataset by merging diverse visual instruction datasets in a unified multimodal instruction format. To examine the effectiveness of UMIT , we train several models based on OpenFlamingo in different multimodal instruction formats used by existing MLLMs. Extensive experiments confirm that UMIT can significantly improve the in-context learning ability on a wide range of vision-language tasks, compared with prior formats, including MME Benchmark and SEED-Bench. Furthermore, we conduct a comprehensive study on the impact of different components in multimodal instruction formats on the in-context learning ability of MLLMs in 3 traditional vision-language tasks. The results indicate that UMIT successfully constrains the model to focus on task-specific information within in-context exemplars by incorporating a task definition component, thus giving it remarkable advantages over prior formats on zero- and few-shot generalization during both the training and testing stages.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8964
Loading