InsertAny3D: VLM-Assisted and Geometry-Grounded Framework for 3D Object Insertion in Complex 3D Scenes

Junnan Liu; Yuyang Yin; Longfei Li; Jiannan Huang; Ke Xing; Xiao Yu; Yunpeng Chen; Xiaojie Jin; Yunchao Wei

InsertAny3D: VLM-Assisted and Geometry-Grounded Framework for 3D Object Insertion in Complex 3D Scenes

Junnan Liu, Yuyang Yin, Longfei Li, Jiannan Huang, Ke Xing, Xiao Yu, Yunpeng Chen, Xiaojie Jin, Yunchao Wei

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Editing

Abstract: The insertion of 3D objects into complex scenes is a critical task in 3D asset editing. Previous works use 2D inpainting models to edit multi-view images and lift them into 3D, which suffers from manual intervention and multi-view inconsistencies. To address these issues, we propose InsertAny3D, a novel framework for high-quality 3D object insertion guided by ambiguous natural language instructions in complex scenes. Our framework consists of two key components: (1) VLM-Assisted 3D Scene Understanding, which decomposes abstract user intents and selects optimal insertion regions through a hierarchical vision-language reasoning strategy; and (2) Geometry-Grounded 3D Object Insertion, which performs anchor-constrained 3D object generation and placement using depth-based feature matching and multi-view geometric verification to ensure spatial coherence. Extensive experiments demonstrate that InsertAny3D significantly outperforms existing methods in insertion precision, visual quality, and interactive usability.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6080

Loading