InsertAny3D: VLM-Assisted and Geometry-Grounded Framework for 3D Object Insertion in Complex 3D Scenes

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Editing
Abstract: The insertion of 3D objects into complex scenes is a critical task in 3D asset editing. Previous works use 2D inpainting models to edit multi-view images and lift them into 3D, which suffers from manual intervention and multi-view inconsistencies. To address these issues, we propose InsertAny3D, a novel framework for high-quality 3D object insertion guided by ambiguous natural language instructions in complex scenes. Our framework consists of two key components: (1) VLM-Assisted 3D Scene Understanding, which decomposes abstract user intents and selects optimal insertion regions through a hierarchical vision-language reasoning strategy; and (2) Geometry-Grounded 3D Object Insertion, which performs anchor-constrained 3D object generation and placement using depth-based feature matching and multi-view geometric verification to ensure spatial coherence. Extensive experiments demonstrate that InsertAny3D significantly outperforms existing methods in insertion precision, visual quality, and interactive usability.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6080
Loading