\section{Introduction}
\label{sec:intro}

Breakthroughs in diffusion models have unlocked unprecedented avenues for generating images and videos. Models such as Stable Diffusion~\cite{rombach_high-resolution_2022}, Midjourney, and Dall-E~\cite{ramesh_hierarchical_2022} have driven this evolution, with their outputs creating a transformative shift across diverse creative domains. Their influence reaches digital hobbyist circles, established professional practices like illustration, graphic design, and multimedia arts, and fosters innovative artistic exploration and community collaboration.

Despite the enormous generative affordances of these methods, broader output controllability is necessary for better adoption in creative communities, often reliant on a trial-and-error process of iterative refinement and on mood boarding and inspiration.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.4\textwidth]{sec/figures/Figures_for_disentanglement_paper_14.png}
    \caption{Examples of image editing results on Wild-TI2I and ImageNet-R-TI2I real and generated images.}
    \label{fig:edit_comparison}
\end{figure}

While previous generations of image generation models, including Variational Autoencoders~\cite{kingma_auto-encoding_2022} and Generative Adversarial Networks~\cite{karras_style-based_2019}, leverage the latent space for image editing \cite{higgins_beta-vae_2016, karras_style-based_2019, shen_interfacegan_2022}, diffusion models \cite{sohl-dickstein_deep_2015, ho_denoising_2020} are based on a Markov chain denoising process and inherently lack a single latent space. In the context of U-Net-based diffusion models, training-free approaches to image editing focus on swapping different modules of the denoising architecture, including the self- and cross-attention modules and the h-space - the bottleneck of the U-Net. However, the skip connection - an essential element within the U-Net, aiding the transmission of long-range dependencies and the gradient propagation - has not been explored. In contrast to existing work, we focus on the former and its role in U-Net-based diffusion models. 

%, introduced in 2015 by Sohl-Dickstein \etal~\cite{sohl-dickstein_deep_2015} and popularized after Ho \etal's 2020 Denoising Diffusion Probabilistic Models (DDPM)~\cite{ho_denoising_2020}, 
%These models generally leverage either a U-Net-based backbone~\cite{ronneberger_u-net_2015} or a Transformer-based one~\cite{peebles_scalable_2023, vaswani_attention_2017}. In the first versions of latent diffusion models, namely Stable Diffusion 1.4, 1.5, 2, 2.1, XL, and Turbo alternatives, denoising is achieved by training a U-Net. While this architecture features a bottleneck layer, referred to by the literature as the \textit{h-space}~\cite{kwon_diffusion_2023}, its nature is different from previous models due to the presence of residual skip connections and time-dependence. 

%A line of research attempts to treat the \textit{h-space} of the model or the noise space, \textit{$\epsilon$-space}, similarly to StyleGAN \cite{karras_style-based_2019, kwon_diffusion_2023, zhu_boundary_2023, haas_discovering_2023}, searching for a linear disentangled direction to manipulate the output image. They show that injecting an image's \textit{h-space} into another image changes the high-level semantics while retaining the structure and background. The same phenomenon is not observable in TTI latent diffusion models, where the effect of the \textit{h-space} on the image is often negligible, as shown in \cref{sec:skip}. 

%In the context of TTI latent diffusion models, several works focus on training-free editing by injection of relevant elements of the architecture of the diffusion process~\cite{hertz_prompt--prompt_2022, liu_towards_2024, tumanyan_plug-and-play_2023}. They show that injection of the middle decoding blocks, the cross-attentions, and self-attentions control the content and achieve successful text-driven editing.

%In addition to the \textit{h-space}, U-Net architectures incorporate another component worth considering in the context of TTI models: the skip connections. What role do these skip connections play in relation to the function of the \textit{h-space}? %While studies suggest that the \textit{h-space} can be considered the latent space in line with traditional GAN and VAE spaces, which are semantically organized and low-dimensional, the latent space of the bottleneck layer of the U-Net is much higher dimensional (around 150x the size of the h-space), time-step related, and with spatial dimensions (typically the \textit{h-space} is $1280\times 8 \times 8$ dimensional). While disentangled directions found in the latent spaces of GANs and VAEs have been found to exhibit a semantically aligned cosine similarity, the similarity of disentangled directions found in the \textit{h-space} does not mirror humanly perceived similarity~\cite{schaerf_colorwai_2024}. Therefore, we propose interpreting the U-Net as having multiple latent spaces, corresponding to the \textit{h-space} and the set of \textit{skip connections} at each time step.

To better understand the role of this module, we address the following questions: (i) How and where is information represented in the skip connections of the U-Net? (ii) How does it influence image generation? (iii) When does this information arise during the denoising process?
% (iv) Do the latent spaces lie in a lower dimensional manifold

\begin{figure}[h]
    \centering
    \includegraphics[width=0.4\textwidth]{sec/figures/36.png}
    \caption{Image editing results on generated faces. We show precise transformations ranging from subtle changes, like makeup and hairstyle adjustments, to more global effects, including zombie-like effects. Our model preserves the core identity of each subject, maintaining facial structure.}
    \label{fig:celeba_comparison}
\end{figure}
\begin{figure*}[h]
    \centering
    \includegraphics[width=0.7\textwidth]{sec/figures/35.png}
    \caption{Examples of style transfer results on the Artist dataset \cite{jiang_artist_2024}.}
    \label{fig:style_comparison}
\end{figure*}

Interestingly, we observe that Stable Diffusion internally disentangles content from style within the third encoder/decoder block, with the content passing through the skip connection and the style through the main flow.

We find that injecting the third group of connections produced by the encoder from image \textit{A} to image \textit{B} transfers the spatial configuration of image \textit{A} onto image \textit{B}. Conversely, we find that image \textit{B} transfers the style to image \textit{A} using the same injection, indicating that the corresponding third decoder block carries the style information. Additionally, leveraging the injection timestep controls the appearance of the background of image \textit{B} over image \textit{A}, and modulating the mixing on the embedding offers control of the strength of the injection. 

We demonstrate that an informed use of the properties of Stable Diffusion can achieve state-of-the-art performance on a wide variety of tasks, offering ample control over the intensity and nature of the output. In \cref{sec:experiments}, we highlight the superiority of our method in achieving text-based image editing and style transfer and show preliminary results on fine-grained feature editing in \cref{fig:celeba_comparison}.
%Furthermore, when image A and image B represent the same or similar subject, the skip connections of image A can be used to edit image B, modifying specific features with the aid of the prompt. 

To summarize, we contribute as follows: 
\begin{itemize}
    \item We investigate the role of the skip connections in the U-Net of Stable Diffusion, assessing their properties, their influence on the image, and variation across time steps.
    \item We propose an efficient and controllable image editing method and prove superiority or on-par SOTA performance on transferring content and style.
    \item Lastly, we propose three alternatives to modulate the editing effect.
\end{itemize}

\section{Related work}
In this section, we briefly explain the importance of latent space studies in the contexts of media studies and digital arts to further motivate the focus of this paper. Successively, we cover image editing methods on Stable Diffusion.

\subsection{Latent space in the arts and humanities}
The latent space, understood strictly as the space where the data lies in the bottleneck layer of a model, is a topical entity for studying and understanding models beyond technical fields. These spaces are studied as n-dimensional cultural objects \cite{rodriguez-ortega_techno-concepts_2022}. The latent spaces make continuous and spatialized the cultural knowledge fed into or generated by the model, creating an implicit meaningful organization \cite{seaver_everything_2021}. These representations can then be studied as a map of culture \cite{underwood_mapping_2021}, and can, in turn, be used to study models as cultural snapshots of reality \cite{underwood_can_2021, cetinic_myth_2022, impett_there_2023, weisbuch_cultural_2017}. 

Digital artists and creative industries extensively used latent space-rooted methodologies, such as latent space walks and interpolation, to take advantage of the semantic continuity of this space. Initiating with DeepDream~\cite{noauthor_inceptionism_nodate}, the latent space continuity, opposed to reality's fragmentation, creates an attractive space of artistic hallucination \cite{noauthor_art_nodate}.

\subsection{Image manipulation}
In this section, we present some of the pivotal works in this direction, organized by what element is used for editing.

\noindent \textbf{Latent code-based editing.} Asyrp~\cite{kwon_diffusion_2023} uses the \textit{h-space} and CLIP supervision to find a direction of modification in the space at each timestep, to add to the original latent in the denoising process through a modified Diffusion Deterministic Implicit Model (DDIM)~\cite{song_denoising_2022}. Boundary diffusion~\cite{zhu_boundary_2023}, on the other side, computes a modification direction that is injected only at the mixing step, testing both \textit{$\epsilon$-space} and \textit{h-space}. Haas \etal~\cite{haas_discovering_2023}, among other findings, show that injecting the \textit{h-space} of an image into another image changes the high-level semantics while retaining the structure and background. InjectFusion~\cite{jeong_training-free_2024} observe the same phenomenon, implementing a calibrated procedure to inject the new \textit{h-space}, maintaining the same correlation to the skip connections. These methods are mostly based on unconditional DDPM-based models trained on specific datasets for \eg CelebA.  

\textbf{Module-based editing.} Prompt2Prompt (P2P)~\cite{hertz_prompt--prompt_2022} substitute the cross-attentions of the U-Net layers to obtain text-based image editing. Plug and Play (PnP)~\cite{tumanyan_plug-and-play_2023} find that accurate editing can be achieved by injecting the spatial features of the middle decoding and self-attention layers. Closely related to the two previous works, Liu \etal~\cite{liu_towards_2024} investigate the role of the cross-attention and the self-attention in the different feature layers, observing again that intermediate features are the most salient. Finally, Artist~\cite{jiang_artist_2024} shows that using the middle residual blocks as PnP to control the content and the cross-attentions to inform the style obtains successful text-driven stylization.

\textbf{Text-based editing.} A common alternative for diffusion models leverages the manipulation of text conditioning. Methods like DiffusionCLIP~\cite{kim_diffusionclip_2022} and InstructPix2Pix~\cite{brooks_instructpix2pix_2023} fine-tune the model or the text conditioning to obtain desired edits. Various successful methods tackle personalizing the outputs to specific entities such as Dreambooth~\cite{ruiz_dreambooth_2023}. Lastly, methods like SDEdit~\cite{chandramouli_ldedit_nodate} leverage partial inversion and text-guided generation to achieve fast, training-free editing.

\textbf{Adapters.} Other popular methods leverage adapters, including ControlNet~\cite{zhang_adding_2023} and T2IAdapter~\cite{mou_t2i-adapter_2023} to increase the modalities that can be used to control the diffusion process. In fact, they train an ad-hoc adapter for each additional modality, obtaining perceptually interesting outcomes. To increase the manipulability, other methods make use of specifically trained LoRA adapters, like PreciseControl~\cite{parihar_precisecontrol_2024} CTRLorALTer~\cite{stracke_ctrloralter_2024}, LoRAdapter~\cite{gandikota_concept_2023}, which can achieve controlled modifications for the trained semantic. 

\subsection{Novelty}
Our approach is the simplest but allows the greatest control compared to existing methods. It proposes numerous plug-ins to modulate the effect and allows editing the content and style of the image in the same pipeline. Lastly, we show that our method performs well on Turbo alternatives, obtaining the fastest results. 
