FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: text-to-image generation
Abstract: Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment, thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing an improved GRPO-based algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves superior performance on existing text-to-image benchmarks and also outperforms most prior prominent methods on PairComp. Anonymous Project: https://anonymous.4open.science/r/FocusDiff_Anonym-1F44.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 5219
Loading