Abstract: We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multimodal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA’s effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on an in-house large-scale benchmark. EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA’s potential as a generalist model for autonomous driving applications. We hope that our results will inspire research to further evolve the state of the art in autonomous driving model architectures.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank the reviewers for their insightful feedback and constructive suggestions. We have carefully considered all comments and have revised the manuscript accordingly. In the spirit of open science and reproducibility, we have also disclosed the scale of our internal datasets.
The major changes in this camera-ready revision are summarized below (sorted by relative locations in the PDF):
1. Discuss self-supervised references.
- Added a sentence at the end of Section 2.1.
2. Clarify self-supervised nature.
- Added a sentence at the end of Section 2.1.
3. Example prompts for driving rationales.
- Added example prompts in Section 2.2, [R1 Scene Description] and [R3 Behavior description of critical objects].
4. De-annomynize WOMD in various places.
5. Add a new Section 3.1 "Summary of Datasets".
- This section put together all the major datasets in the experimental section. Also we revealed the actual scale of the internal datasets.
6. Visualize failure examples.
- Added Section "A.3 Failure Examples" and Figure 12 in the Appendix.
7. Provide concrete prompts and predictions.
- Added Section "A.4 Concrete Prompts and Answers" and Table 6.
8. Discuss NAVSIM.
- Added discussions on open-loop evaluation in Section "A.5 Limitations, Risks, and Mitigations".
9. Discuss inference cost.
- Added a paragraph at the end of Section "A.5 Limitations, Risks, and Mitigations".
Assigned Action Editor: ~Chunyuan_Li1
Submission Number: 4557
Loading