Keywords: 3D Reconstruction, 3D Edition, VLM, Code Synthesis
TL;DR: We propose a new paradigm for 3D reconstruction and editing by generating code with VLMs, and evaluate it on various models.
Abstract: Most recent 3D reconstruction and editing systems operate on implicit and explicit representations such as NeRF, point clouds, or meshes. While these representations enable high-fidelity rendering, they are inherently low-level and hard to control automatically.
In contrast, we advocate a new \bfunderline{3D} reconstruction paradigm based on vision-language-models (VLMs) \bfunderline{Co}de \bfunderline{S}ynthesis (\bfunderline{3D-CoS}), where 3D assets are constructed as executable Blender code, a programmatic and interpretable medium.
To assess how well current VLMs can use code to represent 3D objects, we evaluate leading open-source and closed-source VLMs in code-based reconstruction under a unified protocol. We further introduce two generic improvements: a planning stage that produces a ratio-based, part-level blueprint before code synthesis, and Retrieval-Augmented Generation (RAG) over well-organized Blender API documents.
To demonstrate the unique advantages of this representation, we also present an evaluation focused on localized, text-driven modifications, comparing our code-based edits to state-of-the-art mesh-editing methods.
Our study shows that code as a 3D representation offers strong controllability and locality, exhibiting significant advantages in edit fidelity, identity preservation, and overall visual quality.
Our work also analyzes the potential of this paradigm and specifically delineates the current capability frontier of VLMs for programmatic 3D modeling, demonstrating the promising future of reconstruction by code.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17797
Loading