Human-Centric Context and Self-Uncertainty-Driven Multi-Modal Large Language Model for Training-Free Vision-Based Driver State Recognition

Chuanfei Hu, Xinde Li

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Transactions on Intelligent Transportation SystemsEveryoneRevisionsCC BY-SA 4.0
Abstract: A vision-driven driver monitoring system plays a vital role to guarantee the driving safety. Recent advances focus on modeling a learning-based method to realize the driver monitoring system, benefiting from the powerful capability of data-driven feature extraction. Although the acceptable performances of these methods are achieved, the training procedure with massive data would significantly increase the labor costs. Thus, it is intuitive to explore a training-free vision-based driver state recognition in the era of large language model (LLM)/multi-modal large language model (MLLM). There are two issues should be considered. First, the general prompt might not guide MLLM to focus on the human-centric visual appearances, resulting in the insufficient understanding of MLLM for the contextual cues of driver. Second, the inherent uncertainty of MLLM might impact the reasoning precision, which is not considered comprehensively for the MLLM-based driver state recognition. In this paper, we focus on a vision-based driver state monitoring method, where a novel training-free driver state recognition method via human-centric context and self-uncertainty-driven MLLM (HSUM). Specifically, a human-centric context generator (HCG) is first proposed based on a context-specific prompt. MLLM is guided to capture the human-centric contextual cues as a scene graph. The contextual interaction of objects with their surroundings can be represented effectively. Then, a self-uncertainty response enumerator (SRE) is proposed to exploit the uncertainty of MLLM. The potential reasoning responses are enumerated repeatedly based on the assembly of the human-centric context and uncertainty-specific prompt. Furthermore, to reveal the precise reasoning result from the enumerated responses, we introduce the Dempster-Shafer evidence theory (DST)-based combination rule to conduct an evidence-aware fusion (EAF). The precise answer could be gathered based on DST-based combination rule theoretically. Extensive experiments are conducted on two public benchmarks, where the competitive performance of HSUM is demonstrated compared with the state-of-the-art training-based methods. The accuracy and F1 score of dangerous states are 61.74% and 57.60%, where the gaps between training-based methods are 4.43% and 6.85%. The code will be public at https://github.com/w64228013/HSUM
Loading