3D Dense Captioning beyond Nouns: A Middleware for Autonomous Driving

Bu Jin; Yupeng Zheng; Pengfei Li; Sujie Hu; Zhijie Yan; Xinyu Liu; Yuhang Zheng; Jingjing Huang; Jinwei Zhu; Guyue Zhou; Yilun Chen; Hao Zhao

3D Dense Captioning beyond Nouns: A Middleware for Autonomous Driving

Bu Jin, Yupeng Zheng, Pengfei Li, Sujie Hu, Zhijie Yan, Xinyu Liu, Yuhang Zheng, Jingjing Huang, Jinwei Zhu, Guyue Zhou, Yilun Chen, Hao Zhao

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Autonomous Driving, Dense Captioning, Foundation model

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Recently, language foundation models have revolutionized many fields and how they could enable smarter and safer autonomous vehicles remains elusive. We believe one major obstacle is the lack of a comprehensive and standard middleware representation that links perception and planning. We rethink the limitations of existing middleware (e.g., 3D boxes or occupancy) and propose 3\textbf{D} d\textbf{e}n\textbf{s}e capt\textbf{i}onin\textbf{g} beyond \textbf{n}ouns (or abbreviated as DESIGN). For each input scenario, DESIGN refers to a set of 3D bounding boxes with a language description for each. Notably, the \textbf{comprehensive} description involves not only what the box is (noun) but also its attribute (adjective), location (preposition) and moving status (adverb). We design a scalable rule-based auto-labelling methodology to generate DESIGN ground truth, guaranteeing that the middleware is \textbf{standard}. Using this methodology, we construct a large-scale dataset nuDesign based upon nuScenes, which consists of an unprecedented number of 2300k sentences. We also present an extensive benchmarking on nuDesign, featuring a model named DESIGN-former that takes multi-modal inputs and predicts reliable DESIGN outputs. Through qualitative visualizations, we demonstrate that DEISGN, as a novel 3D scene understanding middleware, has good interpretability. We release our code, data and models, hoping this middleware could trigger better autonomous driving algorithms and systems that benefit from the power of language foundation models.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7137

Loading