\section{Approach}






Generation consistency for many images was proven to be a challenging task (\cite{shi2024mvdream}). Unlike GSEdit \cite{palandra2024gseditefficienttextguidedediting}, which used Instruct Pix2Pix edits iteratively for geometry optimization with SDS loss, we edit the frames in only a single inference with our modified Instruct Pix2Pix model, making 2D edits and 3D reconstruction as independent processes.  

The main idea of our approach is to process the edits described in the input prompt with a single inference of the generative model to achieve consistency. For that, we divide all of the input frames into specifically four orthogonal key frames and the rest of the intermediate (inter) frames. The key images capture the most information about the object from different angles, and the edits with the diffusion model are applied only to those with our proposed Multi-View variation of the Instruct Pix2Pix model \cite{brooks2023instructpix2pix} (see \cref{sec:mv-instruct}). The rest of the frames are edited by interpolating the edited key frames into the poses of the original inter frames (see \cref{sec:interp}) to achieve edits consistent with the key ones. The full pipeline is shown in \cref{fig:full}. This idea is highly inspired by the 3D-GSR \cite{bondarets2024dgsr} work, which leverages consistent 3D Super-Resolution by leveraging 2D Super-Resolution and 3D GS models.

\begin{figure}[hb!]
    \centering
    \includegraphics[width=0.4\textwidth]{./Figures/model_sheet.png}
    \caption{A model sheet is a single image composed as a grid of the orthogonal frames of the same object.}
    \label{fig:model_sheet}
\end{figure}


\subsection{Model sheet}
The idea is first introduced in MVDream \cite{shi2024mvdream} work, where the authors trained a multi-view diffusion model to generate four orthogonal views as a grid. Similar to their approach, we also compose a grid of orthogonal frames as in \cref{fig:model_sheet}, which we call a model sheet, as the SD produces consistent generations when done in a single inference.

\begin{figure*}[ht!]
    \centering
    \includegraphics[width=0.995\textwidth]{./Figures/new_pipe_hor.png}
    \caption{The complete architecture of \we{}. Our proposed MV Instruct Pix2Pix XL is used to edit several key frames from the input sequence given the prompt guidance, and the complex interpolation algorithm is used to achieve the edits of all intermediate frames consistent with the already changed images. Once all of the frames are edited, the images are upscaled with a Super-Resolution model and passed to the 3D GS reconstruction together with the SfM point cloud created from the original sequence.}
    \label{fig:full}
\end{figure*}


% \begin{figure}[hb!]
%     \centering
%     \begin{subfigure}[c]{0.125\textwidth}
%         \centering
%         \includegraphics[width=\textwidth]{./Figures/pre_sr.png}
%         \caption{Before}
%     \end{subfigure}
%     \begin{subfigure}[c]{0.125\textwidth}
%         \centering
%         \includegraphics[width=\textwidth]{./Figures/post_sr.png}
%         \caption{After}
%     \end{subfigure}
    
%     \caption{Comparison of an already edited image quality before (a) and after (b) Super-Resolution refinement step.}
%     \label{fig:sr}
% \end{figure}

\subsection{Multi-View Instruct Pix2Pix XL} \label{sec:mv-instruct}


Instruct Pix2Pix \cite{brooks2023instructpix2pix} is a high-fidelity, text-guided image editing model built upon the original Stable Diffusion v1 framework \cite{rombach2021highresolution}, capable of generating images at a resolution of $512 \times 512$.

To enhance both the quality and resolution of 2D-generated outputs, we adapt the training methodology of Instruct Pix2Pix to a more advanced model—Stable Diffusion XL (SDXL) \cite{podell2023sdxl}, which is approximately three times larger in scale. Specifically, we follow the modified training instructions provided in the official implementation (\url{https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix_sdxl.py}) and fine-tune the SDXL-base-1.0 checkpoint as our foundation.

Ensuring consistency in prompt-guided edits across multiple frames of a given sequence presents a key challenge. To address this, we employ a model sheet approach, wherein multiple key frames are aggregated into a single composite image (sheet). Edits are then applied in a single inference pass, preserving temporal and structural coherence across frames. To facilitate this, we modify the data generation process for Instruct Pix2Pix while maintaining the original training pipeline, introducing Multi-View Instruct Pix2Pix XL (MV Instruct Pix2Pix XL) for multi-view editing.

For our dataset construction, we utilize Objaverse 1.0 \cite{deitke2022objaverseuniverseannotated3d}, a large-scale collection of over 800K 3D models. To optimize computational resources—given the high cost of SDXL fine-tuning—we filter the dataset to include only high-definition (HD) models, reducing it to approximately 50K assets. From each selected 3D model, we generate random renderings from four orthogonal viewpoints, with an initial camera position randomly assigned. These renderings are then composed into model sheets, with each 3D asset yielding 10 different sheets from diverse perspectives, resulting in a total of 40 individual renders per object. Each model sheet is further paired with a unique editing prompt, ultimately producing 500K training samples.

Following the methodology outlined in the original Instruct Pix2Pix paper, we leverage the GPT-3 model \cite{brown2020languagemodelsfewshotlearners} to generate triplets of text prompts. Each triplet consists of (1) an image caption, (2) an edit instruction, and (3) a caption describing the modified image. These structured prompts are then used in conjunction with Stable Diffusion and Prompt-to-Prompt \cite{hertz2022prompttopromptimageeditingcross} to generate the corresponding image edits.

More specifically, for the input model sheet $m$ composed of 4 key frames, the forward pass of the SD-XL model adds noise to the encoded latent variable $z = \mathcal{E}(x)$ and produces the noisy variable $z_t$. We learn a network $\epsilon_\theta$ to predict the noise added to the diffused latents $z_t$, conditioned with the edited by Prompt-to-Prompt \cite{hertz2022prompttopromptimageeditingcross} model sheet $c_M$ and the text prompt $c_T$, by optimizing the conditioned latent diffusion loss function:
\begin{equation} \label{eq:our_loss}
    L = \mathbb{E}_{\mathcal{E}(m), \mathcal{E}(c_M), c_T, \epsilon, t} \Bigl[ || \epsilon - \epsilon_\theta(z_t, t, \mathcal{E}(c_M), c_T) ||_2^2 \Bigr].
\end{equation}

While the generation process ensures consistency within a single model sheet, it does not guarantee coherence across multiple sheets when edited in separate inference passes. If two model sheets were edited independently, the resulting modifications could diverge, leading to inconsistencies across frames. To maintain uniformity across all original input frames, we propose interpolating the edited outputs of a single model sheet rather than generating new sheets for each edit. This approach prevents discrepancies between successive generations and ensures a more temporally and structurally consistent editing process.

\begin{figure*}[hb!]
    \centering
    \textit{"Make astronaut ride a huge cat"}
    
    \begin{subfigure}[b]{0.27\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./Figures/mv/orig.png}
        \caption{Multi-view input}
    \end{subfigure}
    % \hfill
    \begin{subfigure}[b]{0.225\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./Figures/mv/canvas.png}
        \caption{Original Instruct Pix2Pix \cite{brooks2023instructpix2pix}}
    \end{subfigure}
    % \hfill
    \begin{subfigure}[b]{0.27\textwidth}
        \centering
        \includegraphics[width=\textwidth]{./Figures/mv/our_instruct.png}
        \caption{Multi-View Instruct Pix2Pix XL (our)}
    \end{subfigure}
    
    \caption{2D editing comparison on a simple model sheet of 2 images (a) with original Instruct Pix2Pix (b) and our MV model (c).}
    \label{fig:instructs}
\end{figure*}

\subsection{Interpolation} \label{sec:interp}

Since MV Instruct Pix2Pix XL generates edits for only a subset of frames from the model sheet, it is necessary to propagate these modifications to the remaining input images. To achieve this, we interpolate the content of the edited model sheet frames into the corresponding unmodified poses, effectively transferring the new edits to the entire sequence. The interpolation process between two key frames is illustrated in \cref{fig:interp}.

\begin{figure}[htb!]
    \centering
    \includegraphics[width=0.475\textwidth]{./Figures/interpolation.png}
    \caption{Interpolation process of a single intermediate frame batch given two key frames. Firstly, we classically interpolate only the binary mask of the area to apply edits with some margin, and then the Ezsynth \cite{trenton2023ezsynth} model is used to generate the edited images, given the updated content from the key frames.}
    \label{fig:interp}
\end{figure}
For this interpolation, we employ the Ezsynth video stylization model \cite{trenton2023ezsynth}, which is based on the Recurrent All-Pairs Field Transforms (RAFT) model for optical flow \cite{teed2020raftrecurrentallpairsfield}. In our pipeline, we treat Ezsynth as a black-box component, fine-tuning its hyperparameters to reduce reliance on edge detection, thereby allowing for more significant geometric transformations in the inputs.

To ensure high-fidelity interpolation, we constrain the model to modify only the regions that were edited by MV Instruct Pix2Pix XL. For each input frame, we generate a binary mask highlighting the areas requiring interpolation. This is accomplished by first identifying the modified regions in the model sheet frames by computing the difference between the original and edited images and thresholding the result to create four binary masks. To refine these masks, we apply morphological opening followed by closing operations.

Additionally, since our pipeline incorporates a Structure from Motion (SfM) step for geometry estimation (see \cref{sec:sfm}), we leverage the matched keypoints and estimated homography between sequential frames. This allows us to warp and transform the masks derived from edited model sheets, generating corresponding masks for the remaining frames. These refined masks are then used as input to the interpolation model, enabling it to synthesize realistic and spatially consistent modifications. This approach produces more coherent and visually accurate interpolated frames than directly transforming the edited regions.

More specifically, having the key frames $k_1' = \epsilon_\theta(k_1)$ and $k_2' = \epsilon_\theta(k_2)$ edited with the MV Instruct Pix2Pix XL model $\epsilon_\theta$, we can describe the interpolating process as an inference of the learned model $\epsilon_\psi$ used to edit the original intermediate frame $i$ into edited frame $i'$:
\begin{equation} \label{eq:inter-frame}
    i' = \epsilon_\psi(i, m_i, k_1', k_1 - k_1', k_2', k_2 - k_2'),
\end{equation}
where the $m_i$ mask of the interpolated image $i$ is obtained via computing the homography matrix $H$ between key and inter frames: $m_i = H^{-1} i$.

The key contribution of this component is that it ensures consistency between intermediate frames and the already edited key frames generated by MV Instruct Pix2Pix XL. As a result, the final output maintains a unified and seamless sequence, preserving both geometric and visual coherence throughout the edited frames.

\begin{figure*}[ht!]
    \centering
    \includegraphics[width=0.95\textwidth]{./Figures/showcase/lion.png}
    \caption{Results of edits with \we{}. The input sequence of a lion is being edited to new 3D models with respect to the input prompts. Our model is capable of producing complex geometry changes (bottom row with structure changes) and appearance changes (top row with color changes).}
    \label{fig:our_lion}
\end{figure*}


\subsection{Super-Resolution} \label{sec:sr}

To further enhance the visual fidelity and detail of the images, we integrate an additional Super-Resolution upscaling model, processing frames in batches of four. For this purpose, we employ the transformer-based Swin2SR model \cite{conde2022swin2sr}.
% A qualitative comparison of images before and after applying the Super-Resolution model is presented in \cref{fig:sr}.


\subsection{Structure from Motion} \label{sec:sfm}

Since 3D Gaussian Splatting (3D GS) requires an initial sparse point cloud, and our interpolation pipeline (see \cref{sec:interp}) relies on estimated homographies between frames, we utilize a classical Structure from Motion (SfM) approach from the COLMAP library \cite{schoenberger2016sfm}.

The SfM reconstruction process is applied exclusively to the original, non-edited images. Running SfM on the edited images often leads to failures due to minor inconsistencies introduced during the editing process, as the traditional COLMAP pipeline lacks robustness against such artifacts. However, performing SfM on the original images consistently succeeds, providing a reliable foundation for subsequent 3D GS optimization with an initial coarse set of features.

\subsection{3D Gaussian Splatting}

For the final 3D reconstruction, we employ the original implementation of 3D Gaussian Splatting \cite{kerbl20233d} as a black-box model. The optimization process refines the initial Gaussians obtained from the sparse point cloud to align with the updated, edited appearance. Additionally, we disable the model's ability to represent view-dependent colors via spherical harmonics (SH) coefficient optimization to maintain consistency in color representation.

% For training the model, we use the train-test split strategy of Mip-NeRF 360  \cite{barron2022mipnerf}, where 7/8 of the input images are assigned to the training split, and the remaining 1/8 are for testing. We calculate the SSIM, PSNR, and LPIPS metrics on the test partition to compare the quality of renderings of two models trained on original and edited images, respectively. The aim is to find out whether the edited images are consistent enough to optimize the 3D model of the same quality as the model trained on only originals.

