Mantis: Interleaved Multi-Image Instruction Tuning

Published: 15 Nov 2024, Last Modified: 15 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: **Changes compared to the last version** 1. We have updated details about how we curate different subsets in the Mantis-Instruct in Appendix A.1 to help people understand the additional details behind the curation of Mantis-Instruct. We also have open-sourced all the codes for constructing each subset, which can also be helpful for reproduction. This is requested by reviewers `8Nmm`, `berU`, and `Eh2y` 2. We add investigations on how Mantis performs on open-ended tasks that require long-from outputs in Appendix A.2. Specifically, we evaluate Mantis on image and video captioning tasks and showcase its ability on long-form generation besides the pure MCQ answering. This is requested by reviewers `berU` and `Eh2y` 3. We add experiments about how the performance of Mantis will change in the long context scenario in Appendix A.3. This is requested by `8Nmm` 4. We added experiments about how Mantis's performance will change when the number of frames received changes for video understanding tasks in Appendix A.3. This is requested by reviewer `8Nmm`. 5. We added a detailed comparison with MIMIC-IT, which is also a multimodal instruction-following dataset, by comparing the performance between Otter and Mantis-Flamingo in Appendix A.4. This is requested by reviewer `Eh2y`. 6. We added discussion with contemporary works like LLaVA-Next-Interleave and LLaVA-OneVision in section 4.2 as the recognition of their works. This is requested by reviewer `berU`. 7. We discussed the intuition and reference behind the 3 heuristics used to create the datasets in section 2.3. This is requested by the reviewer `8Nmm`. We also discussed that the image denotation methods come from previous work Co-Instruct, which has conducted detailed comparisons in different multi-image denotation formats, from which we simply select the best one. This is requested by reviewer `berU`
Code: https://github.com/TIGER-AI-Lab/Mantis
Assigned Action Editor: ~Dennis_Wei1
Submission Number: 3179
Loading