Abstract: Anticipating traffic accidents, which involves predicting potential traffic accidents in advance, is crucial for autonomous vehicles. In this study, we introduce a novel approach that utilises Spatial and Temporal Adapters, specifically designed for image-to-video adaptation through parameter-efficient transfer learning (PEFTL) in the context of traffic accident anticipation. To fully leverage the knowledge from a pretrained CLIP Vision Transformer (CLIP-ViT), the proposed architecture incorporates lightweight Dilated Neighbourhood Attention (DNA) within Adapters. Furthermore, DNA is integrated with a cross-attention mechanism in the Temporal Adapter to capture long-range temporal dependencies. The combination of these Adapters significantly enhances spatio-temporal adaptation, addressing the limitations of existing methods in accurately identifying accident-prone areas while achieving the earliness of accident anticipation in an end-to-end manner. Extensive experiments conducted on two widespread benchmark datasets, DAD and CCD, demonstrate notable performance improvements compared to state-of-the-art works.
Loading