Keywords: Text-Image Alignment, Reinforcement Learning, Visual Reasoning
Abstract: Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static Question Answering (QA) pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose $\textbf{REVEALER}$, a reinforcement-guided visual reasoning framework for element-level text-to-image alignment evaluation.
Adopting a structured $''grounding–reasoning–conclusion''$ paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization (GRPO) using a multi-dimensional reward function that targets format compliance, localization precision, and alignment accuracy.
Extensive experiments confirm that REVEALER achieves state-of-the-art results across four benchmarks. Notably, on EvalMuse-40K, it surpasses the strong proprietary Gemini 3 Pro and Training-based baselines with absolute accuracy gains of $\textbf{+4.0\%}$ and $\textbf{+13.1\%}$, respectively.
Ablation studies further demonstrate the efficacy of our method, contributing a cumulative $\textbf{19.4\%}$ improvement over the base model.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Resources and Evaluation, Generation, Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 8535
Loading