Contra4: Evaluating Contrastrive Cross-Modal Reasoning in Audio, Video, Image, and 3D

Contra4: Evaluating Contrastrive Cross-Modal Reasoning in Audio, Video, Image, and 3D

ACL ARR 2025 February Submission2770 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: To achieve a deeper understanding of the world, AI must be able to reason across multiple modalities, such as images, audio, video, and 3D. While recent efforts have extended multimodal models to process multiple modalities, there is little evidence that they enable reasoning beyond two modalities simultaneously. This limitation arises partly from the challenge of constructing tasks that require reasoning across multiple modalities. To address this, we introduce Contra4, a dataset designed to train and evaluate contrastive cross-modal reasoning over up to four modalities (audio, video, image, and 3D) simultaneously. Our approach unifies modalities through human-annotated captions and generates contrastive question-answer pairs, filtered via a mixture-of-models round-trip-consistency check. Human inspection validates the high quality of Contra4, with 83.3% perceived correctness, while fine-tuning on the task results in a 56% relative accuracy improvement. Benchmarking against state-of-the-art models on a human annotated subset of 2.3k samples underscores the dataset’s challenge, with the best-performing model achieving only 56% accuracy on the full dataset and just 42% in four-modality settings.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal content generation,vision question answering,cross-modal application,video processing,multimodality

Contribution Types: Data resources

Languages Studied: English

Submission Number: 2770

Loading