3D-Scene-Entities: Using Phrase-to-3D-Object Correspondences for Richer Visio-Linguistic Models in 3D Scenes
Abstract: Recently, there has been significant progress in connecting natural language to real-world 3D scenes. Namely, for the problems of reference disambiguation and discriminative reference production for objects in 3D scenes, various deep-learning-based approaches have been explored by tapping into novel datasets such as ScanRefer (Chen et al., 2019) and ReferIt3D (Achlioptas et al., 2020). In this paper, we curate a large-scale and complementary dataset extending both the aforementioned ones by associating all objects mentioned in a referential sentence to their underlying instances in a 3D scene. Specifically, our 3D Scene Entities (3D-Scent) dataset provides an explicit correspondence between 369,039 objects, spanning 705 scenes, over 84,015 natural referential sentences. Crucially, we show that by incorporating simple and intuitive losses that enable learning from this new dataset, we can significantly improve the performance of several recently introduced neural listening architectures, including improving the SoTA by 5.0% in both the ScanRefer and Nr3D benchmarks. Moreover, we experiment with competitive baseline methods for the task of language generation and show that, as with neural-listeners, 3D neural-speakers can also noticeably benefit by training with Scent3D. Last but not least, our carefully conducted experimental studies strongly support the conclusion that, by learning on Scent3D, commonly used visio-linguistic 3D architectures can become more semantically robust in their generalization without needing to provide these newly collected annotations at test time.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
5 Replies
Loading