A Multi-Modal Architecture With Spatio-Temporal-Text Adaptation for Video-Based Traffic Accident Anticipation
Abstract: Early and precise accident anticipation is critical for preventing road traffic incidents in advanced traffic systems. This paper presents a Multi-modal Architecture with Spatio-Temporal-Text Adaptation (MASTTA), featuring a Visual Encoder and a Text Encoder within a streamlined end-to-end framework for traffic accident anticipation. Both encoders leverage the CLIP model, pre-trained on large-scale text-image pairs, to utilize visual and textual information effectively. MASTTA captures complex traffic patterns and relationships by fine-tuning only the adapters, reducing retraining demands. In the Visual Encoder, spatio-temporal adaptation is achieved through a novel Temporal Adapter, a novel Spatial Adapter, and an MLP Adapter. The Temporal Adapter enhances temporal consistency in accident-prone areas, while the Spatial Adapter captures spatio-temporal interactions among visual cues. The Text Encoder, equipped with a Text Adapter and an MLP Adapter, aligns latent textual and visual features in a joint embedding space, refining semantic representation. This synergy of text and visual adapters enables MASTTA to model complex spatial interactions across long-range temporal context, improving accident anticipation. We validate MASTTA on DAD and CCD datasets, demonstrating significant improvements in both the earliness and correctness compared to state-of-the-art methods.
External IDs:dblp:journals/tcsv/PateraCF25
Loading