NexusAD: Exploring the Nexus for Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving
This report presents our approach for the Corner Case Scene Understanding track of the Autonomous Driving Challenge at the ECCV 2024 Workshop. The advent of multimodal large-scale language models (MLLMs) like GPT-4V has showcased remarkable multimodal perception and understanding capabilities, even in dynamic street scenes. However, applying MLLMs to address the corner cases in autonomous driving remains a largely unexplored area. Using the CODA-LM dataset, which features visual images paired with textual descriptions and analyses of corner cases, we adopted InternVL-2.0 as our base model and conducted domain-specific fine-tuning tailored to driving scenes. In this work, we enhance spatial correlation utilization within images by leveraging position and depth information to improve driving scene perception. Additionally, we incorporate chain-of-thought reasoning for greater accuracy and develop a context learning mechanism based on scene-aware retrieval, which further refines the model's understanding. This comprehensive strategy culminated in a final score of \textbf{68.97} on the leaderboard. Our code will be released at https://github.com/OpenVisualLab/NexusAD.