TL;DR: We propose SafeAuto that includes a specialized PDCE loss for low-level control to improve precision and safety, and enhances high-level action prediction by integrating past driving experiences and precise traffic rules into multimodal models.
Abstract: Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge.
To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light => stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes.
Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available at https://github.com/AI-secure/SafeAuto.
Lay Summary: Autonomous driving systems traditionally rely on separate modules for decision-making (e.g., deciding when to stop at a red light) and controlling the vehicle (e.g., adjusting speed or steering). This separation often results in inefficient or unsafe behaviors because high-level decisions and precise control actions are deeply interconnected. Recently, large AI models capable of understanding both visual scenes and text have offered a promising solution by combining these tasks into one unified system. However, teaching these models precise safety rules, like traffic regulations, remains challenging.
We introduce a new approach that explicitly integrates safety rules into these models to enhance autonomous driving reliability. Our method includes a specialized loss function (PDCE loss) that improves the accuracy of numerical predictions (like vehicle speed) without sacrificing the AI model’s language-based reasoning capabilities. Additionally, we embed structured safety rules into a logical reasoning framework, enabling the AI to verify its driving decisions explicitly. Lastly, we developed a retrieval system that allows the AI to learn from previous driving scenarios to better handle new situations.
Together, these innovations significantly improve autonomous driving safety, accuracy, and reliability compared to existing systems.
Link To Code: https://github.com/AI-secure/SafeAuto
Primary Area: Deep Learning->Large Language Models
Keywords: Autonomous Driving; Multimodal Large Language Models; Multimodal Retrieval-Augmented Generation; Probabilistic Graph Model; Markov Logic Network
Submission Number: 15475
Loading