Keywords: Composition, classifier-free guidance, diffusion, text2image
Abstract: We propose to improve multi-concept prompt fidelity in text-to-image diffusion
models. We begin with common failure cases—prompts like “a cat and a clock”
that sometimes yields images where one concept is missing, faint, or colliding
awkwardly with another. We hypothesize that this happens when the diffusion
model drifts into mixed modes that over-emphasize a single concept it learned
strongly during training. Instead of re-training, we introduce a corrective sampling
strategy that steers away from regions where the joint prompt behavior overlaps
too strongly with any single concept in the prompt. The goal is to steer towards
“pure” joint modes where all concepts can coexist with balanced visual presence.
We further show that existing multi-concept guidance schemes can operate in unstable
weight regimes that amplify imbalance; we characterize favorable regions
and adapt sampling to remain within them. Our approach, CO3, is plug-and-play,
requires no model tuning, and complements standard classifier-free guidance. Experiments
on diverse multi-concept prompts indicate improvements in concept
coverage, balance and robustness, with fewer dropped or distorted concepts com-
pared to standard baselines and prior compositional methods. Results suggest that
lightweight corrective guidance can substantially mitigate brittle semantic alignment behavior in modern diffusion systems.
Primary Area: generative models
Submission Number: 16147
Loading