SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models

ICLR 2025 Conference Submission13573 Authors

28 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autonomous Driving; Multimodal Large Language Models; Multimodal Retrieval-Augmented Generation; Probabilistic Graph Model
TL;DR: We propose SafeAuto that includes a specialized PDCE loss for low-level control to improve precision and safety, and enhances high-level action prediction by integrating past driving experiences and precise traffic rules into multimodal models.
Abstract: Traditional autonomous driving systems often struggle to harmonize high-level reasoning with low-level control, leading to suboptimal and even unsafe driving behaviors. The emergence of multimodal large language models (MLLMs), capable of processing visual and textual data, presents an opportunity to unify perception and reasoning tasks within a single framework. However, integrating precise safety knowledge into MLLMs for safe autonomous driving remains a significant challenge. To address this, we propose SafeAuto, a novel framework that enhances MLLM-based autonomous driving systems by incorporating both unstructured and structured knowledge. In particular, we first propose the Place-Dependent Cross-Entropy (PDCE) loss function, which is specifically designed to enhance the accuracy of low-level control signal predictions when treating numerical values as text. To explicitly integrate precise safety knowledge into the MLLM to enable safe autonomous driving, we build a reasoning component for SafeAuto, which first parses driving safety regulations into first-order logic rules (e.g., "red light $\implies$ stop") and then integrates these rules into a probabilistic graphical model, such as a Markov Logic Network (MLN). The environment attributes, identified by attribute recognition models (e.g., detecting a red light), are used to form the predicates in MLN. In addition, the environmental attributes utilized for reasoning are also considered factors in retrieval to construct a Multimodal Retrieval-Augmented Generation (RAG) model, which aims to learn from past similar driving experiences more effectively. Extensive experiments demonstrate that SafeAuto significantly outperforms baselines across multiple datasets. By bridging the gap between high-level reasoning and low-level control, SafeAuto paves the way for more accurate, reliable, and safer autonomous driving, facilitating systems that learn effectively from experience, adhere to traffic regulations, and execute precise control actions.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13573
Loading