Iterative Finetuning VLM with Retrieval-augmented Synthetic Datasets Technical Reports for W-CODA Challenge Track-1 from Team OpenDriver
Keywords: Vision Language Models, Synthetic Datasets
Subject: Corner case mining and generation for autonomous driving
Confirmation: I have read and agree with the submission policies of ECCV 2024 and the W-CODA Workshop on behalf of myself and my co-authors.
Abstract: Large Vision-Language Models (LVLMs) play a crucial role in autonomous driving, offering advanced visual reasoning capabilities that enhance system interpretability. However, these models often struggle with corner cases in open-world environments, leading to degraded performance. This paper addresses two key challenges: the limitations of pre-trained vision encoders in recognizing unfamiliar objects and the insufficient reasoning abilities of existing models. We propose a solution that leverages synthetic datasets and iterative finetuning to enhance model performance. Our approach improves the model’s visual knowledge and reasoning capabilities, resulting in substantial performance gains on the CODA-LLM benchmark, including a 1.82x increase in generation perception, a 97.34\% improvement in region perception, and a 2.09x enhancement in driving suggestion accuracy. These results demonstrate the effectiveness of our method in improving LVLMs for open-world autonomous driving scenarios.
Submission Number: 9
Loading