Multi-View Spatial-Temporal Learning for Understanding Unusual Behaviors in Untrimmed Naturalistic Driving Videos
Abstract: The task of Naturalistic Driving Action Recognition aims to detect and temporally localize distracting driving behavior in untrimmed videos. In this paper, we introduce our framework for Track 3 of the 8 th AI City Challenge in 2024. The approach is primarily based on large model fine-tuning and ensemble techniques to train a set of action recognition models on a small-scale dataset. Starting with raw videos, we segment them into individual action sequences based on their annotation. We then fine-tune four different action recognition models, with K-fold cross-validation applied to the segmented data. Following this, we execute a multi-view ensemble, selecting the most visible camera views for each action class to generate clip-level classification results for each video. Finally, a multi-step post-processing algorithm, which is designed for the AI City Challenge dataset’s specific features, is employed to perform temporal action localization and produce temporal segments for the actions. Our solution achieves a final mOS score of 0.7798 and attains the 5 th rank on the public leaderboard for the test set A2 of the challenge. The source code will be publicly available at https://github.com/SKKUAutoLab/AIC24-Track03.
Loading