





\section{Experiments}

\begin{figure*}[ht!]
    \centering
    \includegraphics[width=0.99\textwidth]{./Figures/related_bear.png}
    \caption{Comparison oft our \we{} and other open-source models: GaussianEditor \cite{fang2023gaussianeditorediting3dgaussians} and GaussCtrl \cite{wu2024gaussctrlmultiviewconsistenttextdriven}. While our model performs more robust color changes, other models perform scene-level edits, while we focus only on object-level reconstruction.}
    \label{fig:works_bear}
\end{figure*}




\subsection{Instruct Pix2Pix vs Ours}

We evaluate the performance of our MV Instruct Pix2Pix XL model against the original Instruct Pix2Pix in a 2D editing task using a model sheet. A visual comparison is presented in \cref{fig:instructs}, demonstrating that our proposed approach achieves greater consistency across different frames. In contrast, the original Instruct Pix2Pix struggles to maintain high-quality and consistent modifications, even in a simple case of two sequential images.


\begin{table}[hb]
    \small
    \centering
    \renewcommand{\arraystretch}{1.25}
    \begin{tabular}{|c|c|c|}
        \hline
        \normalsize{Instruct Pix2Pix model} & \normalsize{$\text{CLIP}_1 \uparrow$} & \normalsize{$\text{CLIP}_2 \uparrow$} \\
        \hline\hline
        Our (model sheet) & 0.86 & \textbf{0.14} \tabularnewline \hline
        Original (model sheet) & 0.79 & 0.09 \tabularnewline \hline
        Original (frame-wise) & \textbf{0.91} & 0.13 \tabularnewline \hline
        \hline
    \end{tabular}
    
    \begin{tablenotes}
        \item $\text{CLIP}_1$ : Input image similarity
        \item $\text{CLIP}_2$ : Edited text - edited image similarity
    \end{tablenotes}
    
    \caption{Comparison of CLIP text and image alighnment scores on edited model sheets with our MV Instruct Pix2Pix XL and the original model \cite{brooks2023instructpix2pix}.}
    \label{table:instructs}
\end{table}

To quantitatively assess the differences, we compute average CLIP scores on our dataset and report the results in \cref{table:instructs}. The evaluation considers two key metrics: (1) Similarity to the input image, ensuring that the edited image remains structurally consistent with the original, and (2) Text-image alignment, measuring the adherence of the edited image to the input textual prompt.

For Instruct Pix2Pix, we test two scenarios: processing a model sheet (similar to our method) and editing individual frames sequentially (which aligns more closely with the original model's training distribution). Our method outperforms both cases in terms of text-image alignment, indicating that it generates more robust and realistic edits. However, our approach yields lower similarity scores to the input images, suggesting that it introduces more substantial modifications compared to the original Instruct Pix2Pix.

% \subsection{Impact on 3D GS Optimization}

% For training the 3D GS model, we use the train-test split strategy of Mip-NeRF 360  \cite{barron2022mipnerf}, where 7/8 of the input images are assigned to the training split, and the remaining 1/8 are for testing. We calculate the SSIM, PSNR, and LPIPS metrics on the test partition to compare the quality of renderings of two models trained on original and edited images, respectively. The aim is to determine whether the edited images are consistent enough to optimize the 3D model of the same quality as the model trained on only originals.

% \begin{table}[h]
%     \small
%     \centering
%     \subfloat[0.475\textwidth][]{
%         \centering
%         \renewcommand{\arraystretch}{1.25}
%         \begin{tabular}{| c | c | c | c |}
%             \hline
%             \normalsize{Iter} & \normalsize{SSIM $\uparrow$} & \normalsize{PSNR $\uparrow$} & \normalsize{LPIPS $\downarrow$} \\
%             \hline\hline
%             7 000 & 0.979 & 34.152 & 0.04 \\
%             \hline
%             10 000 & 0.981 & 34.614 & 0.038 \\
%             \hline
%             12 000 & 0.981 & 34.809 & 0.036 \\
%             \hline
%             15 000 & 0.981 & 35.044 & 0.035 \\
%             \hline
%         \end{tabular}
%     }
%     \hfill
%     \centering
%     \subfloat[0.475\textwidth][]{
%         \centering
%         \renewcommand{\arraystretch}{1.25}
%         \begin{tabular}{| c | c | c | c |}
%             \hline
%             \normalsize{Iter} & \normalsize{SSIM $\uparrow$} & \normalsize{PSNR $\uparrow$} & \normalsize{LPIPS $\downarrow$} \\
%             \hline\hline
%             7 000 & 0.975 & 32.255 & 0.06 \\
%             \hline
%             10 000 & 0.978 & 33.798 & 0.055 \\
%             \hline
%             12 000 & 0.979 & 33.998 & 0.053 \\
%             \hline
%             15 000 & 0.98 & 34.25 & 0.051 \\
%             \hline
%         \end{tabular}
%     }
%     \caption{Metrics of 3D GS geometry reconstruction on only original (a) and only edited (b) images. They are calculated between input images for reconstruction and the rendered views of the estimated model from the same camera positions.}
%     \label{tab:gs_fit}
% \end{table}

% More precisely, we conduct the experiment with the turning lion statue into a cat from \cref{fig:our_lion}. We performed two separate geometry optimization cases with 3D Gaussian Splatting: the first one on only original images and the second one on only edited ones. We calculated metrics for several crucial intermediate iterations and provided them in \cref{tab:gs_fit}. From the scores, we find that the geometry estimation performs slightly poorer on the edited frames than on the original ones, meaning that there are some degree of inconsistencies present in the editing process. However, the difference is not crucial and is rather insignificant. We conducted the following experiment on ten different examples, and all of them had the same trend in metrics.


\subsection{Results}

We present a diverse set of edited 3D assets generated with \we{} in \cref{fig:mario,fig:our_head,fig:our_lion,fig:our_irl,fig:instructs}. Our approach demonstrates the ability to produce high-quality 3D edits that accurately reflect the desired modifications specified by the text prompt.

We compare our \we{} model with other works in 3D editing with open source code: GaussianEditor \cite{fang2023gaussianeditorediting3dgaussians} and GaussCtrl \cite{wu2024gaussctrlmultiviewconsistenttextdriven} in the \cref{fig:works_bear}. While our solution provides more robust edits, we perform strictly object-level edits, as opposed to other models capable of generalizable scene-level reconstructions.

Our method is effective for both geometric transformations (e.g., converting an object into a different form) and appearance modifications (e.g., style or color adjustments). The generated assets exhibit high fidelity and successfully capture complex geometric structures while maintaining consistency across multiple views.

\begin{figure*}[ht!]
    \centering
    \includegraphics[width=0.95\textwidth]{./Figures/showcase/irl.png}
    \caption{Our \we{} generalizes to both digital (left column) and real-life (right column) inputs.}
    \label{fig:our_irl}
\end{figure*}

\subsection{Real-Life Inputs}

To further evaluate our model's performance, we apply \we{} to turntable-style images captured from real-world objects, as shown in \cref{fig:our_irl}. However, the results indicate a performance degradation compared to digital input data.

Upon analysis of the intermediate outputs, we identify the primary cause as suboptimal key frame selection, resulting from instability in video recordings, abrupt camera movements, or sudden shifts in the object's position. To enhance robustness against such real-world artifacts, we propose integrating a more advanced key frame selection algorithm and applying frame deblurring techniques in future work to improve editing quality.

\begin{figure}[hbt!]
    \centering
    \includegraphics[width=0.325\textwidth]{./Figures/failed/failed.png}
    \caption{Failed cases. \we{} sometimes struggles with isolating a correct area of the edits (middle and bottom row) or struggles in the thin edges and semi-transparent objects as in the top row.}
    \label{fig:failed}
\end{figure}

\subsection{Failed Cases}

Despite its improvements, our approach inherits certain failure cases from the original Instruct Pix2Pix \cite{brooks2023instructpix2pix}, as illustrated in \cref{fig:failed}. In some instances, the model fails to isolate the specified object components accurately. For example, in the ice cream scenario, the model erroneously modifies the green jam instead of the intended waffle. Similarly, in the goblet case, it struggles to preserve the fine structure of the input skeleton, resulting in inconsistencies in bone articulation and subsequent reconstruction artifacts.

Furthermore, our 2D editing and interpolation pipeline occasionally misidentifies elements of an object. For example, the top of a glass is misclassified, leading to a non-transparent blue coloration. Additionally, the model fails to correctly interpret the internal composition of a goblet, mistakenly placing a cherry at its center rather than modifying its interior as intended.
