Learning Diverse Textual Contexts for Robust Personalization of Text-to-Image Diffusion Models

13 Sept 2025 (modified: 28 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Personalized Text-to-Image Generation, Text-to-Image Generation
TL;DR: We diversify the contexts within text space and learn the contexts within text spce for robust T2I personaliation.
Abstract: Text-to-image (T2I) personalization aims to adapt pre-trained T2I models based on user-provided example images for customized image generation. In existing personalization approaches, the models are typically trained with a small number of personal concept images captured in limited contexts. This often weakens robustness, resulting in poor alignment between the text prompts and the generated images. Existing approaches tackled this by collecting images, thereby introducing diversity in the personal concept’s context. Despite its effectiveness, this is often impractical, considering the high cost of image collection. To circumvent this limitation, we instead diversify the contexts of the personal concept in \emph{text space}. Based on the fact that the T2I personalization method represents personal concepts as text tokens \textit{e.g., ``[v]''}, this diversification can be easily achieved by composing the tokens with various contextual words \textit{e.g., ``[v] at Eiffel Tower''}, offering an efficient alternative to costly manual image collection. During personalization, we leverage these text prompts for training to learn diversified contexts. However, utilizing diversified text prompts for personalization is not straightforward, as T2I personalization typically requires paired images as learning targets. To achieve learning without requiring images, we propose to learn them within \emph{text space}. Specifically, we leverage masked language modeling (MLM), which operates entirely within \emph{text space}. By leveraging MLM during personalization, diversified contexts are learned without involving any images. We demonstrate the effectiveness of the proposed approach with extensive experimental results to show that diverse context learning with MLM yields notable improvements in prompt fidelity and state-of-the-art results on widely used public benchmarks. Furthermore, we present an analytical study showing how our approach influences representations in text space through cosine distance analysis of text embeddings, and how these effects propagate to image space via cross-attention maps analysis, providing evidence of its effectiveness.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4602
Loading