A Study on the Effectiveness of GPT-4V in Classifying Driver Behavior Captured on Video Using Just a Few Frames per Video

Published: 01 Jan 2024, Last Modified: 06 Jun 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper introduces an innovative study that evaluates the effectiveness of GPT-4V vision processing technology in identifying risk events within driving scenarios. These scenarios are captured in a series of videos, with GPT-4V’s analysis focusing on only a few frames from each video. The study specifically targets risk behaviors such as yawning, smoking, phone usage, and distractions from the road. To achieve this, it utilizes a comprehensive collection of video recordings featuring drivers, which have been previously annotated by human evaluators to identify and tag instances of such risk behaviors. Our methodology involves a detailed analysis of GPT-4V’s performance, assessing its accuracy and consistency against human benchmarks across both private and public datasets. For the private dataset, GPT-4V demonstrated strong performance in identifying yawning events with a 98.9% accuracy, closely followed by a 98.4% accuracy in detecting smoking. In terms of recognizing driver distractions, it achieved a 91.7% accuracy, and for phone usage, it recorded a 95.7% accuracy. The "Face Not Visible" events achieved a 94.1% accuracy. For the public dataset, GPT-4V achieved an accuracy of 90.9% for the "Using Cellphone" category, with a recall of 76.6% and a precision of 92.1%. In identifying "Distraction" events, it achieved an accuracy of 91.0%, with a recall of 93.1% and a precision of 97.4%. For "Yawning" events, it achieved an accuracy of 98.2%, although its recall was lower at 43.7%, with a precision of 87.5%. These findings significantly enhance our understanding of how multimodal foundation models can be applied to improve road safety. Additionally, these results provide clear directions for future developments in autonomous monitoring systems.
Loading