\section{Preliminary}

\subsection{Instruct Pix2Pix}

Instruct Pix2Pix \cite{brooks2023instructpix2pix} is a state-of-the-art (SOTA) 2D diffusion model designed for text-guided image editing. The authors introduced a novel training methodology based on a synthetically generated dataset, leveraging Prompt-to-Prompt \cite{hertz2022prompttopromptimageeditingcross} and the GPT-3 language model \cite{brown2020languagemodelsfewshotlearners} to create text-image editing pairs.

The dataset generation process consists of three key stages. First, an "edited prompt" is generated by conditioning GPT-3 on the original image description and the given editing instruction, producing a modified description that aligns with the intended transformation. In the second stage, Stable Diffusion \cite{rombach2021highresolution} and Prompt-to-Prompt \cite{hertz2022prompttopromptimageeditingcross} are employed to generate both the original and edited images corresponding to the prompts.

The model is trained by minimizing the latent diffusion objective function, conditioned on text input $c_T$ and the image $c_I$:
\begin{equation} \label{eq:prel_loss}
    L = \mathbb{E}_{\mathcal{E}(x), \mathcal{E}(c_I), c_T, \epsilon, t} \Bigl[ || \epsilon - \epsilon_\theta(z_t, t, \mathcal{E}(c_I), c_T) ||_2^2 \Bigr],
\end{equation}
where $z_t$ is the noisy latent variable after diffusing for $t$ steps the input image $x$ in a latent space with the encoder $z=\mathcal{E}(x)$.
