X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

Artemis Panagopoulou; Le Xue; Ning Yu; Junnan Li; Dongxu Li; Shafiq Joty; Ran Xu; Silvio Savarese; Caiming Xiong; Juan Carlos Niebles

X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: multimodal language model, instruction aware representations, multitask, zero-shot, 3d, video, audio, image, language, frozenllm, alignment

TL;DR: A method for projecting crossmodal data to language models through instruction aware representations.

Abstract: Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLM). In this paper, we introduce a simple, yet effective, multimodal framework built atop a frozen LLM; this framework is capable of seamlessly integrating and managing an ad-hoc number of modalities. To facilitate general-modality training, we collect high-quality instruction tuning data in an automatic and scalable manner, composed of 31K QA samples for audio and 250K QA samples for 3D. We further contribute a novel Discriminative Cross-modal Reasoning (DisCRn) evaluation task, comprising 12K audio-video QA samples and 28K image-3D QA samples. Leveraging instruction-aware representations, our model consistently outperforms or matches the leading-edge counterparts, setting state-of-the-art benchmarks in 7 zero-shot scenarios across all investigated modalities. Notably, our approach demonstrates joint reasoning abilities on par with models specifically trained on combined-modality datasets, like video-audio. All associated resources, including codes, datasets, and benchmarks, will be released.

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8642

Loading