UEval: A Real-World Benchmark for Unified Multimodal Generation

Bo Li; Yida Yin; Wenhao Chai; Xingyu Fu; Zhuang Liu

UEval: A Real-World Benchmark for Unified Multimodal Generation

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodel eval; unified models

TL;DR: We introduce UEval, a challenging real-world benchmark for multimodal generation of unified models.

Abstract: We introduce UEval, a challenging real-world benchmark for multimodal generation of unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated prompt requests that require both images and text in the model output, sourced from 8 diverse real-world domains. Our curated questions cover a wide range of reasoning types from step-by-step guide to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. To address this, we design a rubric-based scoring system in UEval: reference images and text are provided as input, an LLM generates an initial rubric for each question, and human experts refine it to ensure reliability. This question-specific rubric design allows for more tailored and accurate assessment. UEval is designed to be highly challenging: GPT-5-Thinking scores only 61.7 out of 100, while the best open-source model reaches merely 27.1. We observe reasoning models consistently outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that "reasoning" may be essential for requests requiring complex multimodal understanding and generation. The dataset, code, and results will be publicly released.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 1089

Loading