ANCHOR-VIT: SPATIALLY-FOCUSED VISION TRANSFORMER FOR DISTRACTED DRIVING DETECTION

Vivan Doshi

Published: 20 May 2025, Last Modified: 01 Oct 2025IEEE ICIPEveryoneCC BY 4.0

Abstract: Distracted driving remains a critical threat to road safety, often leading to severe accidents and fatalities. Traditional driver monitoring solutions, including CNNs and standard Vision Transformers, frequently fail to capture subtle, spatially localized cues that signal driver distraction. This paper presents Anchor-ViT, a novel Vision Transformer architecture that integrates learnable spatial anchors with a Soft Radial Attention (SRA) mechanism to adaptively focus on driver-critical areas. These anchors are optimized via gradient descent to guide attention toward relevant patches, while SRA employs a Gaussian kernel to reinforce local interactions and preserve global context through a dedicated class token. Evaluations on the State Farm and 100-Driver distracted driving datasets show that Anchor-ViT outperforms baselin ViT models by up to 5.2% in accuracy, effectively balancing the need for localized sensitivity and comprehensive scene understanding. This innovative design holds promise for enhancing driver monitoring, improving overall road and driver safety.