Abstract: To achieve a deeper understanding of the world, AI must be able to reason across multiple modalities, such as images, audio, video, and 3D. While recent efforts have extended multimodal models to process multiple modalities, there is little evidence that they enable reasoning beyond two modalities simultaneously. This limitation arises partly from the challenge of constructing tasks that require reasoning across multiple modalities. To address this, we introduce Contra4, a dataset designed to train and evaluate contrastive cross-modal reasoning over up to four modalities (audio, video, image, and 3D) simultaneously. Our approach unifies modalities through human-annotated captions and generates contrastive question-answer pairs, filtered via a mixture-of-models round-trip-consistency check. Human inspection validates the high quality of Contra4, with 83.3% perceived correctness, while fine-tuning on the task results in a 56% relative accuracy improvement. Benchmarking against state-of-the-art models on a human annotated subset of 2.3k samples underscores the dataset’s challenge, with the best-performing model achieving only 56% accuracy on the full dataset and just 42% in four-modality settings.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation,vision question answering,cross-modal application,video processing,multimodality
Contribution Types: Data resources
Languages Studied: English
Submission Number: 2770
Loading