From Regional to General: A Vision-Language Model-Based Framework for Corner Cases Comprehension in Autonomous Driving

Published: 07 Sept 2024, Last Modified: 15 Sept 2024ECCV 2024 W-CODA Workshop Abstract Paper TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Autonomous Driving, Corner Cases, Scene Understanding, Human-Like Comprehension
TL;DR: Report of the team FNN
Subject: Corner case mining and generation for autonomous driving
Confirmation: I have read and agree with the submission policies of ECCV 2024 and the W-CODA Workshop on behalf of myself and my co-authors.
Abstract: Large Vision-Language Models (LVLMs) have demonstrated their excellent capabilities in handling multi-modal tasks. However, in the filed of autonomous driving, they still face challenges of handling corner cases in traffic scenes, which often involve complex relationships among the road users and objects. To strengthen LVLMs' abilities in understanding traffic scenes, we propose a prompting-based progressive framework that boosts LVLMs' comprehension of the corner cases. Inspired by the thinking modes of human beings, our framework guides the LVLM to analyze regional factors in the scene first and then comprehend the general situation based on the regional understandings. A significance assessment mechanism is introduced in between to determine the scope of the objects that should be considered by the LVLM. The proposed method significantly outperformed the baselines on the CODA-LM dataset. Our code is available at https://github.com/hyhping2023/ECCV_FNN_Code.
Submission Number: 7
Loading