Abstract: The modeling and manipulation of 3D scenes captured from the real world are pivotal in various applications, attracting growing research interest. While previous works on editing have achieved interesting results through manipulating 3D meshes, they often require accurately reconstructed meshes to perform editing, which limits their application in 3D content generation. To address this gap, we introduce a novel single-image-driven 3D scene editing approach based on 3D Gaussian Splatting, enabling intuitive manipulation via directly editing the content on a 2D image plane. Our method learns to optimize the 3D Gaussians to align with an edited version of the image rendered from a user-specified viewpoint of the original scene. To capture long-range object deformation, we introduce positional loss into the optimization process of 3D Gaussian Splatting and enable gradient propagation through reparameterization. To handle occluded 3D Gaussians when rendering from the specified viewpoint, we build an anchor-based structure and employ a coarse-to-fine optimization strategy capable of handling long-range deformation while maintaining structural stability. Furthermore, we design a novel masking strategy that adaptively identifies non-rigid deformation regions for fine-scale modeling. Extensive experiments show the effectiveness of our method in handling geometric details, long-range, and non-rigid deformation, demonstrating superior editing flexibility and quality compared to previous approaches.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Prior neural scene editing methods focus on directly manipulating geometry with the assistance of 3D software, such as Blender. These methods follow a pipeline that extracts meshes from the learned radiance fields and utilizes the geometric structure to guide the deformation of the 3D scene. Due to the imperfect reconstructed geometry, these methods struggle to handle non-rigid deformation and fine-grained editing, limiting their application in 3D reconstruction and editing. To address these problems, we propose a single-image-driven approach to editing the 3D scene, aligning with the philosophy of “what you see is what you get.” Given the 3D Gaussian-based representation of a static scene and an edited image from a given viewpoint as the reference, our method can drive the 3D Gaussians to align with the reference image, therefore achieving 3D editing. The involved editing operations may include translation, rotation, non-rigid geometric deformation, and texture change. Through extensive experiments, we demonstrate that our method can handle object-level and scene-level editing while maintaining 3D consistency and structural stability. We further demonstrate that our method can capture dynamic 3D scenes using single-view video and maintain temporal consistency.
Supplementary Material: zip
Submission Number: 641
Loading