Abstract: Accurately recognizing distracted driving activities in real-world scenarios is essential for improving road and pedestrian safety. However, existing approaches are prone to attending to irrelevant scene context and are susceptible to interference from redundant frames, compromising their robustness in complex driving environments. To overcome these limitations, we propose DualScope, a novel framework that captures behaviorally critical information from both spatial and temporal perspectives. In the spatial domain, we introduce a Synergistic Behavior-Centric Distillation mechanism that leverages two key information sources: (1) position-aware knowledge derived from the SAM model, which enhances the perception of critical regions and their semantic interaction structures; and (2) fine-grained visual details obtained from cropped key regions, which improve the model's ability to capture detailed patterns within behavior-relevant areas. In the temporal domain, we present the Saliency-Aware Fine-to-Coarse Temporal Modeling module, comprising three components: a Fine-Grained Motion Encoder for capturing local inter-frame dependencies; a Dynamic Difference Extractor for generating salient motion dynamics; and a Saliency-Aware Temporal Pyramid Mamba for integrating these representations to enable multi-scale temporal modeling. This design effectively captures both short-term motions and long-term behavioral patterns. Furthermore, incorporating salient dynamics enhances the model's focus on significant behavioral variations. Extensive experiments on seven publicly available DDAR datasets demonstrate that DualScope consistently outperforms state-of-the-art methods, validating its effectiveness in capturing behavioral cues across spatial and temporal dimensions.
Loading