TextIT: Inference-Time Representation Alignment for Improved Visual Text Generation in Diffusion Models

Abhikhya Tripathy; Aishwarya Agarwal; Srikrishna Karanam; Balaji Vasan Srinivasan

TextIT: Inference-Time Representation Alignment for Improved Visual Text Generation in Diffusion Models

Abhikhya Tripathy, Aishwarya Agarwal, Srikrishna Karanam, Balaji Vasan Srinivasan

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY 4.0

Supplementary Material: pdf

Track: Proceedings Track

Keywords: representation alignment, diffusion models, visual text generation, inference-time optimisation

TL;DR: We propose TextIT, a training-free inference-time representation alignment method to improve visual text generation in diffusion models.

Abstract: Recent advances in text-to-image diffusion models have shown remarkable performance in generating realistic images from text descriptions. However, high-quality visual text generation in generated images remains a major challenge. Gibberish text generation is particularly problematic when the model has to generate proper nouns and text that is not commonly present in training data. Unlike existing methods to improve visual text generation which are based on data-intensive and time-consuming fine-tuning approaches, we propose an inference-time representation alignment algorithm, TextIT, that does not need additional data or training. First, we propose an inference-time self-attention manipulation loss that exposes and aligns latent intermediate self-attention (SA) representations governing visual text generation with those of correctly-rendered text. Next, we impose fine-grained control over the generated text by aligning character-wise control points, obtained through self-attention map vectorization, with ground truth character control points. We provide evidence that inference-time representational manipulation enables controllable and interpretable improvements in text-to-image generation, validating our method with character and word-level visual text generation results that retain the overall generative diversity of diffusion models.

Submission Number: 90

Loading