R-Bind: Unified Enhancement of Attribute and Relation Binding in Text-to-Image Diffusion Models

ACL ARR 2025 May Submission1737 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text-to-image models frequently fail to achieve perfect alignment with textual prompts, particularly in maintaining proper semantic binding between semantic elements in the given prompt. Existing approaches typically require costly retraining or focus on only correctly generating the attributes of entities (entity-attribute binding), ignoring the cruciality of correctly generating the relations between entities (entity-relation-entity binding), resulting in unsatisfactory semantic binding performance. In this work, we propose a novel training-free method R-Bind that simultaneously improves both entity-attribute and entity-relation-entity binding. Our method introduces three inference-time optimization losses that adjust attention maps during generation. Comprehensive evaluations across multiple datasets demonstrate our approach's effectiveness, validity, and flexibility in enhancing semantic binding without additional training.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, multimodality
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 1737
Loading