\section{Preliminaries}

In this section, we introduce Latent Diffusion Models (LDMs) as introduced by Rombach\etal~\cite{rombach_high-resolution_2022}, with a particular emphasis on its U-Net~\cite{ronneberger_u-net_2015}.

\subsection{Latent Diffusion} Latent Diffusion Models (LDMs) overcome pixel-based diffusion models' high computational and memory costs by conducting the diffusion process in a reduced latent space. To achieve this, a pre-trained autoencoder converts an image into a compact latent representation $z_0$ of $1/8$ the original per side size. The diffusion process is then applied to $z_0$, which substantially lowers the resource demands during training and sampling. During training, the model is optimized to predict the noise via a neural network, the U-Net. 

\subsection{Components of the U-Net} In our work, we leverage a pre-trained text-conditioned Latent Diffusion Model, which employs a U-Net backbone, popularly recognized as Stable Diffusion (versions 1.4, 1.5, 2, 2.1 and SDXL). In Stable Diffusion, the conventional U-Net architecture~\cite{ronneberger_u-net_2015} is enhanced with attention mechanisms, including self- and cross-attention blocks.

The \textbf{residual block} is inputted the latent features \( \phi_{t}^{l-1} \) from the previous layer \( l-1 \) and outputs both the latent features to be inputted to the following block \( \phi_{t}^{l} \), and the skip connections \( f_t^l \) concatenated directly to the corresponding decoding layer, as:
\begin{equation} 
f_t^l, \phi_t^l = \text{ResBlock}(\phi_{t}^{l-1}),
\end{equation}
\noindent where ResBlock includes convolutional layers.

Stable Diffusion 1-2 models feature four encoding blocks, a bottleneck, and four mirroring decoding blocks. Each of the blocks contains three subblocks, each passing one skip connection. In the remainder of the paper, we refer to the skip connections by number, where $l=0$ is the first skip connection and $l=12$ is the one preceding the blottleneck.

\section{Analysis}
\label{sec:analyses}
As explained in the previous section, skip connections are a critical component of the U-Net backbone, allowing long-range information flow and avoiding the vanishing gradient problem. However, their role within the Stable Diffusion models remains largely unknown. In this section, we present our investigation of the role of each skip connection, the time steps, and the properties of these embeddings to shed some light on these behaviors.

\subsection{The role of skip connections}
\label{sec:skip}
\begin{figure}[h]
\centering
\includegraphics[width=0.5\textwidth]{sec/figures/Figures_for_disentanglement_paper_24.png}
\caption{Visualization of the effect of switching each group of skip connections. We show the result of each skip connection switched on the respective swapped group. We observe that the \textit{h-space} has an almost imperceptible effect on the final image, contrary to research into the disentanglement of DDPMs. The first group of skip connections closest to the \textit{h-space} similarly has a limited effect, whereas the most coherent blending occurs in the second group of skip connections. The third group has no coherent effect on the image, generating random distortions, while the fourth performs akin to raw pixel blending.}
\label{fig:skips}
\end{figure}
To analyze the effect of each skip connection, we store the skip connections of an injection image \( A \) and test the injection of each skip connection and combinations of them into the original image \( B \). We start from a common initial noise $z_t$. We follow two different noise selection strategies: (i) we randomly sample a $z_t$ from a Gaussian distribution, or (ii) we use the result of the DDIM inversion of either \textit{A}, i.e., $z_t^A$, or \textit{B}, i.e.,  $z_t^B$. We first fully denoise $z_t$ with the prompt of image $A$, $p^A$, and store the skip connection $f_t^l$ at each time-step $t$. Successively, we denoise $z_t$ using the prompt of image $B$, $p^B$. At each time-step from $t_{start}$ to $t_{end}$, we substitute the skip connection $f_t^l$ of image $A$. We show an example of the effect we obtain by substituting each group of three skip connections (group 1: $l=1,2,3$; group 2: $l=4,5,6$; group 3: $l=7,8,9$; group 4: $l=10,11,12$) and the \textit{h-space} in \cref{fig:skips}.

Previous studies \cite{tumanyan_plug-and-play_2023, jiang_artist_2024, liu_towards_2024} indicate that the middle decoding layers or the middle cross- and self-attention blocks are the most determinant of the content, suggesting that the structural information is formed roughly halfway in the decoding blocks. While our method aligns with previous findings, being the third group of skip connections roughly halfway in the depth of the model, it suggests that this information is already encoded in the encoder and passed through the decoder via the residual block. 

Accepting standard distinctions of foreground-background and content-style\footnote{While these terms do not have a precise definition, by content, we generally mean the structure of the object, and by style, the colors, textures, and patterns.}, we observe that the injection of the second group of skip connections of image \textit{A} into image \textit{B} preserves the background style of image \textit{B}, in this case, the color scheme, the foreground style of image \textit{B}, the stripes of the zebra, the background content of image \textit{A}, the Savannah, and the foreground content of image \textit{A}, the silhouette of the elephant.  

\begin{figure}[h]
\centering
\includegraphics[width=0.4\textwidth]{sec/figures/Figures_for_disentanglement_paper_28.png}
\caption{Close up into the second group of skip connections. The image shows the effect of this group's different combinations of connections. From the bottom to the top, we injected only one of the skip connections, groups of two, and finally, all three. Specifically, we observe that the combination of $l=4$ and $l=5$ carry the most information: $l=5$ injected alone creates a minimal change in the image but, when combined with $l=4$, determines the spatial structure of the output. $l=4$ alone conveys structure only of the foreground.}
\label{fig:skips_2}
\end{figure}


%This observation is significant because the middle layers occur after the main cross and self-attention processes, representing a point where spatial information is fully structured and ready to be rearranged. Additional high-level features are then progressively incorporated, exerting more influence on the image at the decoding level.

%\subsection{The different versions of Stable Diffusion}

%The difference in behavior to previous findings on the \textit{h-space} seems to indicate an evolution in the representations learned by DDPM, later Stable Diffusion 1.4-1.5 and Stable Diffusion 2 taken into account in this paper. The major difference between Stable Diffusion 1.* and Stable Diffusion 2.* lies in the text embedding used for conditioning: the first versions use OpenAI's CLIP~\cite{radford_learning_2021}, while the second version uses OpenCLIP, trained on the publicly available LAION dataset~\cite{schuhmann_laion-5b_2022}. Because of the open-source nature of the model, the model is less aware of celebrities and styles of contemporary artists.  

%\begin{figure}[h]
%\centering
%\includegraphics[width=0.5\textwidth]{sec/figures/3.png}
%\caption{Visualization of the effect on v1.5. As visible in the figure, the effect of the second group of skip connections is now split between the \textit{h-space}, carrying the content of the foreground, and the second group of skips, determining the content of the background, without structure.}
%\label{fig:versions}
%\end{figure}

%To understand whether the models form different representations, we carry out the same experiment on different versions of Stable Diffusion. We find that Stable Diffusion 2 and 2.1 behave similarly, as well as the Turbo version of 2.1, while Stable Diffusion 1.4 and 1.5 show a larger impact of the \textit{h-space}, as in \cref{fig:versions}. 
 
\subsection{The effect of the timesteps}

In this section, we investigate the role of timesteps in the diffusion process (see \cref{fig:timesteps}) by injecting the skip connection of image \textit{A} into image \textit{B} at $t_{start} \neq 1000 $ and $t_{end} \neq 0$. We observe that the first 150 steps ($t_{start} = 850 $) have little impact on the final image, while the last 150 steps ($t_{end} = 150$) only serve as refinement, as found also in Asyrp~\cite{kwon_diffusion_2023}. We find that the skip connection of image \textit{A} or image \textit{B} for the first $500$ denoising steps determines the content of the foreground, while the last $500$ steps determine the background. 

\begin{figure}[h]
\centering
\includegraphics[width=0.5\textwidth]{sec/figures/Figures_for_disentanglement_paper_17.png}
\caption{Visualization of the effect of the injection timesteps. We observe that starting the injection later at $t_{start} < 850$ leads to distortions in the foreground content while ending the denoising earlier at $t_{end} > 150$ reveals the background content of the original image. This phenomenon is consistent for every image we generate.}
\label{fig:timesteps}
\end{figure}

\subsection{Modulating the effect}
To achieve more controllable results, we investigate methods to modulate the intensity of the change. 

\textbf{Injection classifier-free guidance.} Inspired by classifier-free guidance~\cite{ho_classifier-free_2021}, we test the use of a linear combination of the injected embedding
and original embedding of the changed skip connections, parametrized by $\gamma$ to balance the intensity of the mix. At each denoising step, the injected embedding becomes:
\begin{equation}
f^A(t,l) =  f^B(t,l) + \gamma(f^A(t,l) - f^B(t,l))
\end{equation}
where $t$ is the denoising timestep and $l$ the skip connection layer.

\textbf{Depth-wise alternation of the spatial embedding of the skip connections.} The skip connections are high-dimensional. For instance, the layer $l=4$ for a $512\times512$ output size is $1280\times16\times16$ dimensional. We hypothesize that the information stored in these embeddings is, therefore, highly redundant. %We plot these embeddings as $1280$ images of $16\times16$ pixels, and we find that over $90\%$ of the kernels show the same shape with varying average or inverse intensities. Therefore, we suspect redundancy in the depth channel. 

\begin{figure}[h]
\centering
\includegraphics[width=0.5\textwidth]{sec/figures/Figures_for_disentanglement_paper_18.png}
\caption{Visualization of modulation methods. We show the effects of the two modulation methods at $\gamma = r \in [0,1]$. We observe that both methods achieve a successful modulation of the intensity of the effect and empirically observe that the use of both methods together obtains the best results. The advantage of the guidance is that it can surpass the effect above 1, but, differently from the second modulation method, it struggles in areas around 1, where the image should be similar to the non-modulated effect.}
\label{fig:modulation}
\end{figure}

Based on this hypothesis, we test an additional modulation method: we randomly alternate at a ratio $r$ the images of $f^A(t,l)$ with those of $f^B(t,l)$. That is to say, for every $1280\times r$ $16\times16$ images, we inject the injection image and maintain the original one in the other cases. We observe that the depthwise alternation successfully achieves a modulation as shown in Figure \ref{fig:modulation}.

%In sum, the injection timing can control whether the background is retained or replaced with that of the original image, and the injection strength can be further modulated using classifier-free guidance and depth-channel alternation. 


