Look Locally, Learn Precisely: Interpretable and Unbiased Text-to-Image Generation with Background Fidelity
Keywords: Unbiased Image Generation, Diffusion Models, Generative AI
Abstract: Text-to-image diffusion models have achieved remarkable progress, yet they still struggle to produce unbiased and responsible outputs. A promising direction is to manipulate the bottleneck space of the U-Net (the $h$-space), which provides interpretability and controllability. However, existing methods rely on learning attributes from the entire image, entangling them with spurious features and offering no corrective mechanisms at inference. This uniform reliance leads to poor subject alignment, fairness issues, reduced photorealism, and incoherent backgrounds in scene-specific prompts. To address these challenges, we propose two complementary innovations for training and inference. First, we introduce a spatially focused concept learning framework that disentangles target attributes into concept vectors by proposing three novel mechanisms: suppressing target attribute features within the multi-head cross-attention (MCA) modules and attenuating the encoder output (i.e., $h$-vector) to ensure the concept vector exclusively captures target attribute features. In addition, we introduce a spatially weighted reconstruction loss to emphasize regions relevant to the target attribute. Second, we design an inference-time strategy that improves background consistency by enhancing low-frequency components in the $h$-space. Experiments demonstrate that our approach improves fairness, subject fidelity, and background coherence while preserving visual quality and prompt alignment, outperforming state-of-the-art $h$-space methods. The code is included in the supplementary material.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 6148
Loading