SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios promotes few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called $\underline{\textbf{S}}$patio-temp$\underline{\textbf{O}}$ral fr$\underline{\textbf{A}}$me tu$\underline{\textbf{P}}$le enhancer ($\textbf{SOAP}$) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code will be released.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: In the field of multimedia, high frame-rate videos enhance fine-grained expression but reduce spatio-temporal relation and motion information density, requiring a large number of video samples for data-driven training. Insufficient samples in real-world scenario, promoting few-shot action recognition (FSAR) research. To address challenges of optimizing spatio-temporal relation construction and capturing comprehensive motion information, we propose a plug-and-play architecture $\underline{\textbf{S}}$patio-temp$\underline{\textbf{O}}$ral fr$\underline{\textbf{A}}$me tu$\underline{\textbf{P}}$le enhancer ($\textbf{SOAP}$) for FSAR. SOAP considers temporal connections between different feature channels and spatio-temporal relation of features, captures comprehensive motion information with broader perspective. The model we designed with such architecture called SOAP-Net achieves state-of-the-art performance across benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Furthermore, SOAP seamlessly integrates with different multimodal methods and brings benefits in our experiments. Considering the competitiveness, pluggability, generalization, and robustness of SOAP, we hope and believe that this work will promote future research in multimedia, especially the development of media interpretation and multimedia applications.
Supplementary Material: zip
Submission Number: 2360
Loading