Submission Type: Non-archive
Keywords: Vision-Language models, Maneuver prediction, large language models, MobileViT
TL;DR: We proposed the 'ManeuverVLM', including visual and temporal encoders to extract embeddings from scene images and dynamic signals for driving maneuver prediction.
Abstract: Maneuver prediction in modern vehicles enhances safety by anticipating driver actions, enabling advanced driver assistance systems (ADAS) to provide proactive support and accident prevention. This research presents `ManeuverVLM', a vision-language model (VLM) that integrates scene images and dynamic signals for maneuver prediction. The model employs a vision encoder to extract spatial-visual embeddings, a temporal encoder for dynamic signals, and a large language model (LLM) for maneuver classification. We evaluated the proposed model on our collected dataset that covers five maneuvers: straight, left/right turn, and left/right lane change. Experimental results demonstrate that ManeuverVLM with T5-mini achieves superior performance with micro- and macro-accuracy (99\%, 98\%) and a macro F1-score of 97\% on our collected driving dataset. Notably, ManeuverVLM effectively handles challenging minority maneuvers, such as turning and lane changing, outperforming both Temporal-Only and Spatial-Temporal models that do not integrate LLM. The proposed model, with 36.5 million parameters, 61.2 Giga floating point operations (GFLOPs), and requiring only 163 MB of memory, is deployable on compact embedded processors on the vehicle.
Submission Number: 9
Loading