A Lightweight Vision-Language Model Pipeline for Corner-Case Scene Understanding in Autonomous Driving

Ying Cheng; Min-Hung Chen; Shang-Hong Lai

A Lightweight Vision-Language Model Pipeline for Corner-Case Scene Understanding in Autonomous Driving

Ying Cheng, Min-Hung Chen, Shang-Hong Lai

Published: 07 Sept 2024, Last Modified: 15 Sept 2024ECCV 2024 W-CODA Workshop Abstract Paper TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Corner-Case Scene Understanding, Lightweight Vision-Language Model (VLM), Mixture of Experts (MoE)

Subject: Corner case mining and generation for autonomous driving

Confirmation: I have read and agree with the submission policies of ECCV 2024 and the W-CODA Workshop on behalf of myself and my co-authors.

Abstract: This paper describes our method for the ECCV 2024 Workshop W-CODA Track 1: Corner Case Scene Understanding. We propose LiteViLA: a Lightweight Vision-Language model pipeline for corner-case scene understanding in Autonomous driving, leveraging the TinyLLaVA backbone for efficiently processing large-scale multimodal data. Our approach extracts visual features through a Vision Encoder and Q-Former, with the integration of visual and language modalities handled by the Language Model (LM) through a Mixture-of-Adapters (MoA) mechanism. The MoA dynamically selects task-specific adapters for General Perception, Region Perception, and Driving Suggestions, optimizing performance across these critical tasks. Finally, a Reviewer component refines the generated answers, ensuring their accuracy and relevance.

Submission Number: 8

Loading