DRVMon-VM: Distracted driver recognition using large pre-trained video transformers

Ricardo Pizarro, Luis Miguel Bergasa, Luis Baumela, José Miguel Buenaposada, Rafael Barea

Published: 01 Jan 2024, Last Modified: 17 Aug 2024IV 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent advancements in video transformers have significantly impacted the field of human action recognition. Leveraging these models for distracted driver action recognition could potentially revolutionize road safety measures and enhance Human-Machine Interaction (HMI) technologies. A factor that limits their potential use is the need for extensive data for model training. In this paper, we propose DRVMon-VM, a novel approach for the recognition of distracted driver actions. This is based on a large pre-trained video transformer called VideoMaeV2 (backbone) and a classification head as decoder, which are fine-tuned using a dual learning rate strategy and a medium-sized driver actions database complemented by various data augmentation techniques. Our proposed model exhibits a substantial improvement, exceeding previous results by 7.34% on the challenging Drive&Act dataset, thereby setting a new benchmark in this field.