Keywords: diffusion models, feature injection, editing, style transfer
TL;DR: We probe diffusion backbones (SD1.4/2.0, SDXL, Kandinsky, DiTs) via feature injection, revealing architecture-specific patterns and introducing FeatureInject, a training-free framework for text-guided editing and style transfer.
Abstract: Recent advances in diffusion models have enabled powerful text-to-image synthesis and training-free editing. However, despite growing architectural diversity, most editing techniques rely on implicit assumptions about shared internal representations across models. In this paper, we conduct a systematic, layer-wise analysis of internal representations across a wide range of diffusion architectures, including Stable Diffusion (SD1.4, SD2, SDXL), Kandinsky, and DiT-based models (SD3.5, Flux). We quantify how semantic and stylistic information propagates through U-Net backbones and their transformer-based counterparts using a targeted feature injection protocol. Our findings uncover architecture-specific encoding patterns, such as symmetric representational flow in SD1.4/2.0, bottleneck centrality in SDXL, decoder-centric representation in Kandinsky, and middle-late semantic representation formation in DiTs. We further show that adversarially distilled models preserve, but amplify, their teacher's representational structure. These insights inform a principled injection-based framework for text-guided image editing and style transfer. To the best of our knowledge, we are the first to achieve successful editing on such a broad range of models.
Primary Area: generative models
Submission Number: 17432
Loading