Do Vision and Text Cues Exhibit Evidential Coupling? UFO: A Benchmark for Compositional Multimodal Reasoning in Unified Models

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We contribute a challenging multimodal reasoning benchmark for unified foundation models.
Abstract: Unified Foundation Models (UFMs), which support interleaved multimodal generation and understanding, have been proposed as a promising paradigm for reasoning about dynamic world states, yet it remains unclear whether the visual content they generate functions as grounded evidence for subsequent reasoning or merely as auxiliary output. Existing benchmarks largely evaluate generation and understanding as separate capabilities and do not test their functional dependence during reasoning. We introduce UFO, a benchmark designed to evaluate whether UFMs generate and use image and text cues as evidence for compositional multimodal reasoning. UFO spans three state-transition regimes, state determination, state reconstruction, and state augmentation, which correspond to progressively smaller transformations of the underlying world state. Our analysis reveals a significant modality gap, as models often achieve high prediction accuracy even when the generated visual cues exert limited influence on their decisions, indicating weakened evidential coupling and a reliance on textual shortcuts rather than robust cross-modal grounding.
Lay Summary: Today's most advanced AI can do two things at once: understand images and words, and create new ones of its own. Many hope that letting such an AI "draw" what might happen next will help it reason, like a person sketching a diagram to solve a problem. But it has been unclear whether these systems truly use the pictures they create, or whether they are just for show while the answer comes from word-based shortcuts. To find out, we built UFO, a test of nearly 4,000 questions. For each one, the AI must first produce its own evidence — a written note and a drawing of the situation — and then answer using them. This lets us measure whether the answer genuinely depends on what the AI drew and wrote, rather than on a lucky guess. Testing twelve leading systems, we found they often answer correctly even when their drawings barely affect the result. The pictures and words frequently fail to back each other up, and the systems fall back on word-based shortcuts instead of combining what they "see" and "read." UFO gives researchers a clear way to spot this weakness and build AI whose reasoning is easier to check and trust.
Primary Area: Deep Learning->Large Language Models
Keywords: Benchmark, Multimodal, Reasoning, Unified Foundation Models
Originally Submitted PDF: pdf
Submission Number: 14632
Loading