\begin{figure}[ht!]
    \centering
    \includegraphics[width=0.475\textwidth]{./Figures/intro/intro.png}
    \caption{Our \we{} editing high-level process. We expect a turn-around sequence and the edit prompt on input, and our \we{} returns the edited and reconstructed 3D model.}
    \label{fig:mario}
\end{figure}

\begin{figure*}[ht]
    \centering
    \includegraphics[width=0.995\textwidth]{./Figures/showcase/head_full.png}
    \caption{Our \we{} text-guided edit renderings of the reconstructed 3D models.}
    \label{fig:our_head}
\end{figure*}


\section{Introduction}
\label{sec:intro}

The field of 3D visual generative AI is undergoing significant growth, driven by advances in the quality and realism of 2D generative models, as well as breakthroughs in novel 3D reconstruction techniques, such as 3D Gaussian Splatting \cite{kerbl20233d}. The increasing demand for automation in the creation of high-quality 3D assets has further accelerated research in this domain.

The 3D generative domain comprises various tasks, commonly classified with respect to input modalities: text-to-3D \cite{poole2022dreamfusion, shi2024mvdream, wang2023prolificdreamerhighfidelitydiversetextto3d, yu2023painthumanhighfidelitytextto3dhuman, sun2023dreamcraft3dhierarchical3dgeneration, yu2023textto3dclassifierscoredistillation, shi2024mvdream, chen2024textto3d, nichol2022pointe}, image-to-3D \cite{metzer2022latentnerf, zeng2024ipdreamer, deng2022nerdi, melaskyriazi2023realfusion360degreconstructionobject, tang2023makeit3dhighfidelity3dcreation, xu2023neurallift360liftinginthewild2d, tang2024dreamgaussiangenerativegaussiansplatting}, and 3D-to-3D or editing generations \cite{haque2023instructnerf2nerfediting3dscenes, armandpour2023reimaginenegativepromptalgorithm, brooks2023instructpix2pix, parmar2023zeroshotimagetoimagetranslation, palandra2024gseditefficienttextguidedediting, fang2023gaussianeditorediting3dgaussians}. Although many approaches exist in each direction, many use outdated 2D generative and 3D reconstruction approaches, providing low resolution of the estimated 3D model. Moreover, the 3D-to-3D direction is often treated as re-texturing, aiming to modify only the appearance with no geometry changes.

In this work, we introduce an advanced implicit 3D editing algorithm that operates solely on text prompts, eliminating the need for explicit manual masks or bounding boxes to specify the target region for editing. Our approach presents a novel 3D editing pipeline that integrates sequential components of 3D reconstruction using 3D Gaussian Splatting \cite{kerbl20233d} with a multi-view editing framework. This framework leverages the Stable Diffusion XL \cite{podell2023sdxl} generative model and the 2D editing capabilities of Instruct Pix2Pix \cite{brooks2023instructpix2pix}, allowing high-fidelity text-guided modifications of 3D assets.

In summary, our contributions are:
\begin{itemize}
    \item We propose a novel Multi-View Instruct Pix2Pix XL model, a modified version of the original 2D editing model Instruct Pix2Pix \cite{brooks2023instructpix2pix} to generate consistent multi-view frames of the same object using the SOTA Stable Diffusion XL \cite{podell2023sdxl} generative model.
    
    \item We propose our 3D editing model \we{} by leveraging our pre-trained Multi-View Instruct Pix2Pix XL model in a single-inference manner due to being coupled with the complex interpolation logic, together with the 3D Gaussian Splatting \cite{kerbl20233d} geometry reconstruction.

    \item We demonstrate the generalization of \we{} to various real-life and digital inputs and its application as a post hoc stage for an arbitrary 3D generative model of any geometry representation technique.
\end{itemize}
