Iterative Finetuning VLM with Retrieval-augmented Synthetic Datasets Technical Reports for W-CODA Challenge Track-1 from Team OpenDriver

Zihao Wang; Xueyi Li

Iterative Finetuning VLM with Retrieval-augmented Synthetic Datasets Technical Reports for W-CODA Challenge Track-1 from Team OpenDriver

Zihao Wang, Xueyi Li

Published: 07 Sept 2024, Last Modified: 15 Sept 2024ECCV 2024 W-CODA Workshop Abstract Paper TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Models, Synthetic Datasets

Subject: Corner case mining and generation for autonomous driving

Confirmation: I have read and agree with the submission policies of ECCV 2024 and the W-CODA Workshop on behalf of myself and my co-authors.

Abstract: Large Vision-Language Models (LVLMs) play a crucial role in autonomous driving, offering advanced visual reasoning capabilities that enhance system interpretability. However, these models often struggle with corner cases in open-world environments, leading to degraded performance. This paper addresses two key challenges: the limitations of pre-trained vision encoders in recognizing unfamiliar objects and the insufficient reasoning abilities of existing models. We propose a solution that leverages synthetic datasets and iterative finetuning to enhance model performance. Our approach improves the model’s visual knowledge and reasoning capabilities, resulting in substantial performance gains on the CODA-LLM benchmark, including a 1.82x increase in generation perception, a 97.34\% improvement in region perception, and a 2.09x enhancement in driving suggestion accuracy. These results demonstrate the effectiveness of our method in improving LVLMs for open-world autonomous driving scenarios.

Submission Number: 9

Loading