ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Published: 20 Mar 2025, Last Modified: 10 Feb 2026ICCV 23EveryoneCC BY 4.0
Abstract: Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods describe objects at a single level of detail and do not capture fine-grained details of the parts of objects. In order to produce varying levels of detail capturing both coarse object-level information and detailed part-level descriptions, we propose the task of expressive 3D captioning. Given an input 3D scene, the task is to describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage consistency between the multiple levels of descriptions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level details generated by ExCap3D are more expressive than those produced by state-of-the-art methods, with a CIDEr score improvement of 17% and 124% for objectand part-level details respectively. Our code, dataset and models will be made publicly available
Loading