Abstract: Distracted driving remains a critical threat to
road safety, often leading to severe accidents and fatalities.
Traditional driver monitoring solutions, including CNNs and
standard Vision Transformers, frequently fail to capture
subtle, spatially localized cues that signal driver distraction.
This paper presents Anchor-ViT, a novel Vision Transformer
architecture that integrates learnable spatial anchors with a
Soft Radial Attention (SRA) mechanism to adaptively focus on
driver-critical areas. These anchors are optimized via gradient
descent to guide attention toward relevant patches, while
SRA employs a Gaussian kernel to reinforce local interactions
and preserve global context through a dedicated class token.
Evaluations on the State Farm and 100-Driver distracted
driving datasets show that Anchor-ViT outperforms baselin
ViT models by up to 5.2% in accuracy, effectively balancing
the need for localized sensitivity and comprehensive scene
understanding. This innovative design holds promise for
enhancing driver monitoring, improving overall road and
driver safety.
Loading