\section{Conclusions}

In this work, we introduce MV Instruct Pix2Pix XL, an enhanced version of the Instruct Pix2Pix \cite{brooks2023instructpix2pix} diffusion model, adapted for multi-view image editing. Our approach leverages the state-of-the-art (SOTA) SD-XL \cite{podell2023sdxl} generative model, extending its capabilities to ensure consistent modifications across multiple viewpoints.

We further demonstrate how MV Instruct Pix2Pix XL enables 3D editing within our \we{} pipeline, a novel framework designed for efficient multi-view object modification. Unlike conventional methods that require multiple inference passes, our approach performs a single inference step, followed by a complex interpolation process to propagate edits across all views while maintaining consistency.

To enhance the visual fidelity of the outputs, we incorporate the Swin2SR \cite{conde2022swin2sr} super-resolution model for fine-grained image refinement. Additionally, for geometry estimation, we integrate a Structure from Motion (SfM) pipeline, followed by 3D Gaussian Splatting \cite{kerbl20233d}, enabling high-quality 3D reconstruction of the edited asset.

Our method is generalizable to both digital and real-world inputs and can serve as a post-hoc refinement stage for any existing 3D generative model, regardless of its underlying geometry representation.

Beyond the scope of 3D editing, our work raises broader questions about adapting 2D generative models to multi-view settings in a structured and scalable manner. We introduce a novel frame interpolation technique, ensuring that modifications applied to a subset of frames are seamlessly propagated, preserving object consistency across multiple perspectives.

This research paves the way for future explorations in multi-view generative modeling, extending beyond editing to tasks such as multi-view synthesis, reconstruction, and consistency-aware generation.

% \clearpage