Keywords: Evaluation, Unified Multimodal Model, Visual Generation
Abstract: Unified multimodal models integrate the reasoning capacity of large language models with both image understanding and generation, showing great promise for advanced multimodal intelligence. However, the community still lacks a rigorous reasoning-centric benchmark to systematically evaluate the alignment between understanding and generation, and their generalization potential in complex visual tasks.
To this end, we introduce \textbf{GIR-Bench}, a comprehensive benchmark that evaluates unified models across three complementary perspectives.
Firstly, we explore whether models can consistently leverage the same knowledge for both understanding and generation (GIR-Bench-Uni).
Secondly, we investigate whether models can perform reasoning-centric text-to-image generation that requires applying logical constraints and implicit knowledge to generate faithful visual content (GIR-Bench-T2I).
Thirdly, we evaluate whether models can handle multi-step reasoning in editing (GIR-Bench-Edit).
For each subset, we carefully design different task-specific evaluation pipelines tailored for each task. This enables fine-grained and interpretable evaluation while mitigating biases from the prevalent MLLM-as-a-Judge paradigm.
Extensive ablations over various unified models and generation-only systems have shown that: Although unified models are more capable of reasoning-driven visual tasks, they still exhibit a persistent gap between understanding and generation. The data and code for GIR-Bench are available at
\url{https://anonymous.4open.science/r/GIR-Bench-7E40}.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 1588
Loading