everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Recently, language foundation models have revolutionized many fields and how they could enable smarter and safer autonomous vehicles remains elusive. We believe one major obstacle is the lack of a comprehensive and standard middleware representation that links perception and planning. We rethink the limitations of existing middleware (e.g., 3D boxes or occupancy) and propose 3\textbf{D} d\textbf{e}n\textbf{s}e capt\textbf{i}onin\textbf{g} beyond \textbf{n}ouns (or abbreviated as DESIGN). For each input scenario, DESIGN refers to a set of 3D bounding boxes with a language description for each. Notably, the \textbf{comprehensive} description involves not only what the box is (noun) but also its attribute (adjective), location (preposition) and moving status (adverb). We design a scalable rule-based auto-labelling methodology to generate DESIGN ground truth, guaranteeing that the middleware is \textbf{standard}. Using this methodology, we construct a large-scale dataset nuDesign based upon nuScenes, which consists of an unprecedented number of 2300k sentences. We also present an extensive benchmarking on nuDesign, featuring a model named DESIGN-former that takes multi-modal inputs and predicts reliable DESIGN outputs. Through qualitative visualizations, we demonstrate that DEISGN, as a novel 3D scene understanding middleware, has good interpretability. We release our code, data and models, hoping this middleware could trigger better autonomous driving algorithms and systems that benefit from the power of language foundation models.