Chain-of-Thought Guided Multimodal Large Language Models for Scene-Aware Accident Anticipation in Autonomous Driving
Abstract: Accurately anticipating traffic accidents is a fundamental task for the safe and effective deployment of autonomous vehicles (AVs). However, existing models primarily rely on dashcam footage and often fail to generalize across varied driving scenarios due to their dependence on visual data and the rarity of high-risk events in datasets. These limitations undermine their robustness and reduce practical applicability in dynamic, unpredictable environments. To address these challenges, this study proposes a novel approach, termed MLTA, which integrates multimodal learning with the hypergraph attention network to hierarchically extract and capture cross-modal interaction. It leverages LLava-next, a multimodal large language model (MLLM) guided by the Chain-of-Thought (CoT) prompting paradigm, to produce context-aware interpretations of traffic scenes. This is further enhanced by a human-inspired attention mechanism that mimics the decision-making priorities of experienced human drivers. This combination enables more accurate identification of critical elements in a scene, improving both prediction precision and timeliness. Extensive experiments on four real-world datasets—DAD, A3D, CCD, and DADA-2000—show that our approach consistently outperforms state-of-the-art (SOTA) methods, demonstrating strong adaptability and robustness in complex driving environments.
External IDs:doi:10.1109/tits.2025.3597411
Loading