EviDiff: Learning Object-wise Consistency for Text-to-Image Diffusion

15 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-image composition, Diffusion model
TL;DR: An evidential learning driven T2I diffusion model for efficient multi-object composition.
Abstract: Consistency constraint between the text prompts and the image contents is pivotal in text-to-image (T2I) diffusion models for composing multiple object categories. However, such consistency constraint is often underemphasized in the denoising process of diffusion models. Although token supervised diffusion models can mitigate this issue by learning object-wise consistency between the image content and object segmentation maps, it tends to suffer from the problems of segmentation map bias and semantic overlap conflict, especially when involving multiple objects. To address this, we propose EviDiff, a new evidential learning-supervised T2I diffusion model, which leverages the advantages of uncertainty metric and conflict detection to enhance the fault tolerance of unreliable segmentation maps and suppress semantic conflicts, strengthening object-wise consistency learning. Specifically, a pixel evidence loss is proposed to restrain overconfidence in unreliable labels through evidential regularization, and a token conflict loss is designed to weaken the contradiction between semantics through optimizing a measured conflict factor. Extensive experiments show that our EviDiff outperforms state-of-the-art T2I diffusion models in multi-object compositional generation without requiring additional inference-time manipulations. Notably, our EviDiff can be seamlessly extended to the existing training pipeline of T2I diffusion models. The code and the trained EviDiff model are available at https://github.com/anonymity-coder/EviDiff.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 6280
Loading