\begin{figure*}[h]
\centering
\includegraphics[width=\textwidth]{sec/figures/34.png}
\caption{Qualitative comparison of different prompt-guided editing methods. We use as reference results proposed by \cite{liu_towards_2024} thus, we do not cherry-pick the results. From left to right: source image, target prompt, our result, Free Prompt Editing, P2P \cite{hertz2022prompt}, PnP \cite{tumanyan2023plug}, SDEdit \cite{meng2021sdedit} with two noise levels, DiffEdit \cite{couairon2022diffedit}, Pix2pixzero \cite{martinezintroducing}, Shape-guided \cite{park2024shape}, MasaCtrl \cite{cao2023masactrl}, and InstructPix2Pix \cite{brooks2023instructpix2pix} (a fine-tuning-based method).}

\label{fig:content_transfer}
\end{figure*}

\section{Experiments}
\label{sec:experiments}

We evaluate our method on \textbf{image editing} and \textbf{style transfer}, providing both quantitative metrics and qualitative results. For image editing, we inject $l=4,5$ of image A into image B, and for style transfer, we inject image B into image A. To evaluate our method on text-guided image-to-image and text-to-image translation, we follow established benchmarks, utilizing the Wild-TI2I dataset \cite{tumanyan_plug-and-play_2023} and ImageNet-R-TI2I \cite{tumanyan_plug-and-play_2023}. We adopt the protocol outlined in \cite{jiang_artist_2024} for style transfer evaluation to text-guided style transfer. We ablate the effects of the timestep injection and the modulation methods in the Appendix.

Our evaluation employs two complementary metrics. First, text-image CLIP similarity quantifies how closely the generated images align with the style or edit prompts \cite{radford2021learning}. Second, the distance between DINO ViT self-similarity \cite{caron2021emerging} assesses the degree of structure preservation. Additionally, we use LPIPS \cite{zhang2018unreasonable} to measure perceptual similarity, where lower values indicate better content retention.

We implement our method with the \texttt{Diffusers} library, using a custom 2DUNetConditional model based on pre-trained weights from \texttt{stabilityai/stable-diffusion-2-base}. For image-to-image translation, we apply the \texttt{DDIMInverseScheduler} with 50 steps, generating images with the \texttt{UniPCMultistepScheduler} using 50 inference steps and a guidance scale of 7.5.

\input{plots/pnp}
\begin{figure*}[h]
\centering
\includegraphics[width=0.9\textwidth]{sec/figures/Figures_for_disentanglement_paper_30.png}
\caption{ Qualitative evaluation against current style transfer methods. We use the reference results by \cite{jiang_artist_2024}}, and we do not do any cherry-picking.
\label{fig:style_transfer}
\end{figure*}
cd 
\begin{table*}[h]
\centering
\caption{Evaluation of Different Style Transfer Models on the Artist Dataset \cite{jiang_artist_2024}, measuring Content Preservation (LPIPS) and Stylization Prompt Alignment (CLIP Alignment).}
\label{quantitative_style_transfer}
\resizebox{\textwidth}{!}{
\begin{tabular}{lccccccccccc}
\hline
\textbf{Metric} & \textbf{Ours (l=4,5)} & \textbf{Ours (l=4)} & \textbf{Artist} & \textbf{DDIM} & \textbf{NTI-P2P} & \textbf{PnP} & \textbf{DiffStyler} & \textbf{InstructP2P} & \textbf{ControlNet-Canny} & \textbf{ControlNet-Depth} & \textbf{CLIPStyler} \\
\hline
LPIPS $\downarrow$ & 0.57 & 0.67 & 0.62 & 0.74 & 0.67 & 0.67 & 0.72 & \textbf{0.47} & 0.72 & 0.78 & 0.51 \\
CLIP Alignment $\uparrow$& 26.27 & \textbf{28.55} & 28.33 & 28.38 & 25.87 & 26.4 & 26.82 & 23.59 & 26.4 & 27.05 & 26.14 \\
\hline
\end{tabular}
}
\end{table*}

\subsection{Image editing}


%To assess our model’s performance, we conduct qualitative and quantitative analyses, focusing on identity preservation, structural fidelity, and style accuracy across various transformations.

\noindent\textbf{Qualitative Analysis}  provides a comparative analysis with previous methods. Competing methods frequently exhibit issues: Free Prompt Editing lacks style specificity (e.g., "penguin embroidery" fails to capture the embroidery texture), Prompt2Prompt does not follow the prompt effectively (the horse is not in the museum), and Plug-and-Play leads to feature distortions (e.g., “silver robot”). SDEdit struggles with structural integrity at high noise levels, while DiffEdit and MaCaCntrl lose context (e.g., the "teddy bear" is distorted). In contrast, our model consistently delivers prompt-specific transformations with high structural fidelity, demonstrating robustness across various styles and editing demands.

\noindent\textbf{Quantitative analysis}
Quantitatively, we present the performance of our methods on the ImageNet-R-TI2I and Wild benchmarks. In Fig \ref{fig:pnp_map} we evaluate our model with CLIP cosine similarity (indicating prompt fidelity) and DINO-ViT self-similarity (indicating structural preservation). Across all benchmarks (Wild-TI2I, ImageNet-R-TI2I, and Generated ImageNet-R-TI2I), our model consistently balances high CLIP similarity with low DINO self-similarity, outperforming other methods like SDEdit, VQGAN-CLIP, and DiffuseIT in both text alignment and structural accuracy. Notably, our approach consistently places in the “Better” region, reflecting superior text fidelity and structural integrity.

%In summary, our model demonstrates a robust capacity to translate prompts into high-quality edits with strong structural consistency, outperforming previous methods across qualitative and quantitative metrics.
\begin{comment}
\begin{table}[h!]
\centering
\resizebox{0.5\textwidth}{!}{%
\begin{tabular}{|l|c c|c c|c c|c c|}
\hline
\textbf{Method} & \multicolumn{4}{c|}{\textbf{ImageNet-R-TI2I}}  & \multicolumn{4}{c|}{\textbf{Wild}}\\ 
&  \multicolumn{2}{c|}{\textbf{fake}} & \multicolumn{2}{c|}{\textbf{real}} & \multicolumn{2}{c|}{\textbf{fake}} & \multicolumn{2}{c|}{\textbf{real}} \\ 
 & CS $\uparrow$ & CDS $\uparrow$ & CS $\uparrow$ & CDS $\uparrow$ & CS $\uparrow$ & CDS $\uparrow$ & CS $\uparrow$ & CDS $\uparrow$ \\ \hline
SDEdit (0.5) & - & - & 28.37 & 0.1415 & -& -& 27.48 & 0.122 \\
SDEdit (0.75) & - & - & 30.17 & 0.2171 & - & -& 29.79& 0.2007\\ 
Shape-Guided & - & - & 26.01 & 0.109 & - & - & 26.53 & 0.133 \\ 
DiffEdit & 26.68 & 0.0748 & 26.50& 0.0909 & 25.59 & 0.0794 & 26.33 & 0.0879 \\ 
Pix2pixzero & 27.94 & 0.2271 & 28.96 & 0.1415 & 28.19 & 0.2864 & 29.55 & 0.1462 \\ 
P2P & 28.88 & 0.3394 & 28.56 & 0.2146 & 27.85 & 0.2796 & 28.42& 0.1939\\ 
PnP & 28.83 & 0.2318 & 28.76 & 0.2073 & 28.2 & 0.2838 & 28.46 & 0.202 \\ 
MasaCtrl & 29.66 & 0.3024 & 31.40& 0.2170& 29.96 & 0.3474 & 29.33& 0.2101\\
Und. Attention& 29.79 & 0.3559 & 29.05 & 0.2271 & 27.88 & 0.3116 & 29.04 & 0.2234 \\ 
\hline
Our [2] & 30.38& 0.251& 29.76& 0.2271 & 32.13& 0.2242& \textbf{31.05}& 0.1682\\ 
Our [1,2] & 30.21& 0.251& 30.56& 0.2271 & 31.13& 0.2242& 29.61& 0.1319\\ 
\hline
\end{tabular}%
}
\caption{Comparison of different methods on the ImageNet-R-TI2I and Wild benchmarks.}
\label{tab:comparison_results}
\end{table}
\end{comment}

\subsection{Style Transfer}
\noindent\textbf{Qualitative evaluation} 
Figure \ref{fig:style_transfer} offers a comparative analysis, highlighting distinctive performance variations among competing models. Models like DiffStyler, CLIPStyler, and Plug-and-Play often compromise the fidelity of the original content structure, leading to blurred or distorted shapes, particularly in intricate or highly abstract styles. NTI+P2P exhibits minimal style alteration, evident in the “8-bit pixel art” transformation, where the ship closely resembles the original. However, it is relevant to note that while these models demonstrate varying degrees of style application, evaluating artistic styles can be inherently arbitrary. Styles intended as artistic movements are sometimes conflated with specific methods, making objective assessment challenging. For instance, applying a "Dadaism style" prompt may focus on collage techniques rather than capturing the movement's conceptual essence.

In contrast, our model achieves a balanced and coherent output across styles, effectively preserving not only the content structure and the stylistic features but also adjusting the people, clothing, and objects in a historically coherent manner (as in \cref{fig:style_transfer}). For example, in the “Impressionist painting” transformation, our model accurately replicates the brushstroke aesthetic and introduces a poppy field, typical of Impressionist painters, while maintaining the original shape and posture of the horse. Nonetheless, our method inherits certain biases from Stable Diffusion, resulting in inaccurate visual aesthetics for movements like Cubism, Futurism, and Dadaism despite successfully achieving stereotypical modifications.

In \cref{fig:midjourneystyles}, we show a practical application of the style transfer features of our model, demonstrating interesting applicability to styles presented by the creative communities adopting Midjourney and StabilityAI using both a prompt and an image for style transfer. Note that the latter has not been shown to work for competing works.

\noindent\textbf{Quantitative evaluation} Our method, represented by \textbf{l=4,5} and \textbf{l=4} in Table \ref{quantitative_style_transfer}, demonstrates strong alignment with text prompts while preserving content structure. On the CLIP Alignment metric, the \textbf{l=4} model achieves the highest score of 28.55, with \textbf{l=4,5} close behind at 26.27. These scores indicate that our model adheres effectively to prompt guidance, achieving transformations that accurately reflect the target style. Regarding structural similarity, our \textbf{l=4,5} model attains an LPIPS score of 0.57, with \textbf{l=4} following at 0.67, demonstrating good content retention compared to most baseline models. These lower LPIPS values suggest that our approach maintains structural and perceptual fidelity to the original content, even under significant stylistic transformations. Competing methods, such as DDIM (0.74), DiffStyler (0.72), and ControlNet-Depth (0.78), display higher LPIPS scores, reflecting a greater degree of content distortion. Artist shows competing performance while obtaining inferior qualitative results.

Finally, in \cref{fig:turbo}, we show the impressive editing results achieved on the turbo-distilled version of the model. Previous works have not shown this applicability.

\begin{figure}[h]
\centering
\includegraphics[width=0.5\textwidth]{sec/figures/Figures_for_disentanglement_paper_23.png}
\caption{The AI Art online communities offer an incredible wealth of information on style transfer in blogs such as \href{https://stable-diffusion-art.com}{Stable Diffusion Art} that could be leveraged to build applied benchmarks for style transfer. In this figure, we show two interesting applications of our method: the first consists of the transfer of closed-source styles (e.g., styles used in Midjourney) to Stable Diffusion outputs using single-image style transfer (on the left). The second leverages the style prompts (with respective negative prompts) released by \href{https://stable-diffusion-art.com/sdxl-styles/\#Stability_AI_styles}{StabilityAI} to transfer the described styles to real images or selected generated images (on the right).}
\label{fig:midjourneystyles}
\end{figure}

\begin{figure}[h]
\centering
\includegraphics[width=0.4\textwidth]{sec/figures/Figures_for_disentanglement_paper_26.png}
\caption{Example results of text-based image editing using Stable Diffusion Turbo with 1 step inference on \texttt{wild-ti2i-fake}. The modifications obtained are coherent and cohesive, obtaining radical changes and maintaining the original structure. Compared to multi-step inference, the control over the background is more limited.}
\label{fig:turbo}
\end{figure}

%Overall, these results underscore the ability of our \textbf{1,2-skip} and \textbf{2-skip} models to balance style adherence with structural preservation, consistently outperforming competing approaches in achieving prompt-specific transformations while retaining content integrity.
%Overall, these quantitative results emphasize the efficacy of our \textbf{1,2-skip} and \textbf{2-skip} models in achieving prompt-specific transformations with high perceptual fidelity. The robustness and adaptability demonstrated across diverse image-to-image translation tasks validate our approach as a reliable solution for applications requiring both stylistic variety and content integrity.
%\begin{figure}[h] \centering \includegraphics[width=\textwidth]{sec/figures/style_transfer_examples.png} \caption{Style transfer examples with our method.} \label{fig
%} \end{figure}
%As a response to the previous section, we propose to evaluate and test style transfer on two novel tasks: using images in a style to generate new images via prompt and using style prompts common within the AI Art communities to transfer to real/generated images. We offer two examples of this. 
\section{Conclusion}
In conclusion, this paper explores the impact of U-Net skip connections in Stable Diffusion models, presenting a training-free, efficient approach - SkipInject - that enables high-quality text-guided image editing and style transfer. By systematically examining these skip connections, we address key questions about how spatial and stylistic information is encoded in the latent spaces of Stable Diffusion, the stages within the denoising process where they arise, and the structure of these spaces. Our findings reveal that specific skip connections are fundamental in controlling content and style, providing insight into how these components influence image generation.

The proposed method leverages the l=4 and l=5 skip connections to achieve precise style and content transfer, demonstrating state-of-the-art or on-par performance across established benchmarks. In addition, we introduce three modulation techniques for controlled editing intensity, offering flexible adjustments to meet diverse requirements.

Our approach currently relies on a single latent, limiting its application from scenarios that require dual-image style transfer. Future work will focus on extending SkipInject to support two-image inputs for broader applications.

\section*{Acknowledgments}
Ludovica Schaerf did this work as visiting student at Cambridge Digital Humanities.


\newpage
