ManeuverVLM: A Novel Multimodal Fusion of Scene Images and Temporal Signals for Maneuver Prediction

Roksana Yahyaabadi; Soodeh Nikan

ManeuverVLM: A Novel Multimodal Fusion of Scene Images and Temporal Signals for Maneuver Prediction

Roksana Yahyaabadi, Soodeh Nikan

Published: 14 Jun 2025, Last Modified: 16 Aug 2025MKLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Submission Type: Non-archive

Keywords: Vision-Language models, Maneuver prediction, large language models, MobileViT

TL;DR: We proposed the 'ManeuverVLM', including visual and temporal encoders to extract embeddings from scene images and dynamic signals for driving maneuver prediction.

Abstract: Maneuver prediction in modern vehicles enhances safety by anticipating driver actions, enabling advanced driver assistance systems (ADAS) to provide proactive support and accident prevention. This research presents `ManeuverVLM', a vision-language model (VLM) that integrates scene images and dynamic signals for maneuver prediction. The model employs a vision encoder to extract spatial-visual embeddings, a temporal encoder for dynamic signals, and a large language model (LLM) for maneuver classification. We evaluated the proposed model on our collected dataset that covers five maneuvers: straight, left/right turn, and left/right lane change. Experimental results demonstrate that ManeuverVLM with T5-mini achieves superior performance with micro- and macro-accuracy (99\%, 98\%) and a macro F1-score of 97\% on our collected driving dataset. Notably, ManeuverVLM effectively handles challenging minority maneuvers, such as turning and lane changing, outperforming both Temporal-Only and Spatial-Temporal models that do not integrate LLM. The proposed model, with 36.5 million parameters, 61.2 Giga floating point operations (GFLOPs), and requiring only 163 MB of memory, is deployable on compact embedded processors on the vehicle.

Submission Number: 9

Loading