Keywords: Image Editing, Tuning-free, Non-rigid Editing
TL;DR: FSI-Edit is a tuning-free framework that injects stochastic noise and fuses high-frequency residuals to bridge semantic gaps and preserve background, enabling high-quality rigid and non-rigid image editing.
Abstract: Latent Diffusion-based Text-to-Image (T2I) is a free image editing tool that typically reverses an image into noise, reconstructs it using its original text prompt, and then generates an edited version under a new target prompt. To preserve unaltered image content, features from the reconstruction are directly injected to replace selected features in the generation.
However, this direct replacement often leads to feature incompatibility, compromising editing fidelity and limiting creative flexibility, particularly for non-rigid edits (\emph{e.g.}, structural or pose changes).
In this paper, we aim to address these limitations by proposing \textbf{FSI-Edit}, a novel framework using frequency- and stochasticity-based feature injection for flexible image editing.
First, FSI-Edit enhances feature consistency by injecting \emph{high-frequency} components of reconstruction features into generation features, mitigating incompatibility while preserving the editing ability for major structures encoded in low-frequency information.
Second, it introduces controlled \emph{noise} into the replaced reconstruction features, expanding the generative space to enable diverse non-rigid edits beyond the original image’s constraints.
Experiments on non-rigid edits, \emph{e.g.}, addition, deletion, and pose manipulation, demonstrate that FSI-Edit outperforms existing baselines in target alignment, semantic fidelity and visual quality. Our work highlights the critical roles of frequency-aware design and stochasticity in overcoming rigidity in diffusion-based editing.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 19660
Loading