ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unified Multimodal Model, Generation Benchmark, Cross-modal Reasoning
TL;DR: A benchmark for evaluating cross-modal reasoning in unified multimodal models
Abstract: Unified multimodal models (UMMs) have shown remarkable advances in jointly understanding and generating text and images. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning: textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. As such, existing benchmarks rarely require the use of one modality to guide, verify, or refine outputs in the other. They therefore fail to capture a central aspiration of unified multimodal models, namely to support seamless reasoning across modalities. We address this gap with **ROVER**, a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1285 tasks grounded in 2,048 images, spanning two complementary settings. **Verbally-augmented reasoning for visual generation** evaluates whether models can use structured verbal prompts and reasoning chains to guide faithful image synthesis. **Visually-augmented reasoning for verbal generation** evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes. Experiments on 17 state-of-the-art UMMs reveal two key findings: (i) cross-modal reasoning capabilities strongly correlate with visual generation performance, particularly for interleaved image–text generation; and (ii) current models remain severely limited in visual-augmented reasoning, showing relative strength in perception and physical modeling but weakness in logical tasks. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation. More information on **Anonymous Page**: https://anony0923.github.io
Primary Area: datasets and benchmarks
Submission Number: 1795
Loading