Keywords: Text-to-Image Generation, Semantic Alignment, Diffusion Models
Abstract: Text-to-image (T2I) diffusion models have advanced notably, but still fail to satisfy prompt conditions, resulting in attribute mismatches and missing objects.
These errors normally fall into two categories: $\textit{concept loss}$, when an object or attribute (i.e., concept) goes missing, and $\textit{concept confusion}$, when an attribute is assigned to the wrong object or multiple objects blend together.
Recent studies suggest that these challenges are caused by the limited capacity of CLIP text encoder to capture fine-grained semantic details.
Although several methods have been proposed, concept loss and concept confusion persist in recent T2I models when handling multi-object, multi-attribute prompts.
In this paper, we conduct an embedding-level analysis to develop an inference time, model independent solution, addressing concept loss and concept confusion.
We find that (1) concepts mentioned later exhibit higher embedding entropy, indicating higher uncertainty and making them more vulnerable to concept loss, and (2) the first CLIP attention layer captures the strength of binding between each object and its attribute.
Guided by our findings, we introduce TIE, a method that improves semantic alignment through a single text-embedding update. TIE addresses concept loss via entropy-aware singular value amplification and resolves concept confusion through interpolation–extrapolation binding based on CLIP attention scores, all in a training-free manner.
Extensive experiments demonstrate that TIE enhances semantic fidelity in multi-concept scenarios with minimal sampling overhead.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 20
Loading