NexusAD: Exploring the Nexus for Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving
Keywords: Autonomous Driving, Corner Cases, Large Language Models, Multimodal
TL;DR: This report presents our approach for the Corner Case Scene Understanding track of the Autonomous Driving Challenge at the ECCV 2024 Workshop.
Subject: Large Language Models techniques adaptable for self-driving systems
Confirmation: I have read and agree with the submission policies of ECCV 2024 and the W-CODA Workshop on behalf of myself and my co-authors.
Abstract: This report presents our approach for the Corner Case Scene Understanding track of the Autonomous Driving Challenge at the ECCV 2024 Workshop. The advent of multimodal large-scale language models (MLLMs) like GPT-4V has showcased remarkable multimodal perception and understanding capabilities, even in dynamic street scenes. However, applying MLLMs to address the corner cases in autonomous driving remains a largely unexplored area.
Using the CODA-LM dataset, which features visual images paired with textual descriptions and analyses of corner cases, we adopted InternVL-2.0 as our base model and conducted domain-specific fine-tuning tailored to driving scenes. In this work, we enhance spatial correlation utilization within images by leveraging position and depth information to improve driving scene perception. Additionally, we incorporate chain-of-thought reasoning for greater accuracy and develop a context learning mechanism based on scene-aware retrieval, which further refines the model's understanding. This comprehensive strategy culminated in a final score of \textbf{68.97} on the leaderboard. Our code will be released at https://github.com/OpenVisualLab/NexusAD.
Supplementary Material: pdf
Submission Number: 5
Loading