Abstract: Generating text descriptions of objects in 3D indoor
scenes is an important building block of embodied understanding. Existing methods describe objects at a single
level of detail and do not capture fine-grained details of the
parts of objects. In order to produce varying levels of detail
capturing both coarse object-level information and detailed
part-level descriptions, we propose the task of expressive
3D captioning. Given an input 3D scene, the task is to describe objects at multiple levels of detail: a high-level object
description, and a low-level description of the properties of
its parts. To produce such captions, we present ExCap3D,
an expressive 3D captioning model which takes as input a
3D scan, and for each detected object in the scan, generates
a fine-grained collective description of the parts of the object, along with an object-level description conditioned on
the part-level description. We design ExCap3D to encourage consistency between the multiple levels of descriptions.
To enable this task, we generated the ExCap3D Dataset by
leveraging a visual-language model (VLM) for multi-view
captioning. The ExCap3D Dataset contains captions on
the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and
part-level details generated by ExCap3D are more expressive than those produced by state-of-the-art methods, with
a CIDEr score improvement of 17% and 124% for objectand part-level details respectively. Our code, dataset and
models will be made publicly available
Loading