Directional Textual Inversion for Personalized Text-to-Image Generation

ICLR 2026 Conference Submission8 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: personalized generation, text-to-image models, textual inversion
TL;DR: We propose Directional Textual Inversion that improves text fidelity for personalized text-to-image generation.
Abstract: Textual Inversion (TI) is an efficient approach to text‑to‑image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out‑of‑distribution magnitudes, degrading prompt conditioning in pre‑norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre‑norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in‑distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises–Fisher prior, yielding a constant‑direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI‑variants while maintaining subject similarity. Crucially, DTI’s hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction‑only optimization is a robust and scalable path for prompt‑faithful personalization.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 8
Loading