Keywords: Vision Language Models, Cross Modal Multi Hop Reasoning, Interleaved Image-Text, Synthetic Data Generation, Benchmark Consturction
Abstract: Real-world reasoning often requires combining information across modalities—for example, following a recipe involves connecting textual instructions with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or static sets, where answers can be inferred from one modality alone. This limitation is mirrored in training data, where interleaved image–text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRUX, a new dataset and benchmark built with a scalable automatic pipeline for generating complex cross-modal reasoning tasks. CRUX spans natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Models trained on CRUX show significant gains in grounded multi-hop reasoning, including strong improvements on SPIQA and other multi-image benchmarks.
Primary Area: datasets and benchmarks
Submission Number: 9117
Loading