X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: multimodal language model, instruction aware representations, multitask, zero-shot, 3d, video, audio, image, language, frozenllm, alignment
TL;DR: A method for projecting crossmodal data to language models through instruction aware representations.
Abstract: Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLM). In this paper, we introduce a simple, yet effective, multimodal framework built atop a frozen LLM; this framework is capable of seamlessly integrating and managing an ad-hoc number of modalities. To facilitate general-modality training, we collect high-quality instruction tuning data in an automatic and scalable manner, composed of 31K QA samples for audio and 250K QA samples for 3D. We further contribute a novel Discriminative Cross-modal Reasoning (DisCRn) evaluation task, comprising 12K audio-video QA samples and 28K image-3D QA samples. Leveraging instruction-aware representations, our model consistently outperforms or matches the leading-edge counterparts, setting state-of-the-art benchmarks in 7 zero-shot scenarios across all investigated modalities. Notably, our approach demonstrates joint reasoning abilities on par with models specifically trained on combined-modality datasets, like video-audio. All associated resources, including codes, datasets, and benchmarks, will be released.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8642
Loading