Contrastive Single-Stream Spatio-Temporal Joint Modeling for Few-Shot Action Recognition

Published: 2025, Last Modified: 13 Nov 2025ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Prior work on few-shot action recognition predominantly adopts two strategies: spatio-temporal separated frame matching and multi-stream multi-modal networks. However, each suffering from either incomplete spatio-temporal modeling or an over-reliance on additional annotation data. To address these limitations, we propose a Contrastive Single-Stream Spatio-Temporal joint modeling Few-Shot Action Recognition (CS3T-FSAR) model. In terms of spatio-temporal modeling, our approach directly constructs high-quality three-dimensional spatio-temporal representations to fully capture the global associations among video frames. Regarding the loss function design, we integrate a triplet loss to achieve precise matching while reducing both inference cost and computational complexity. Ultimately, our method achieves significant performance improvements across four benchmark datasets, demonstrating its competitiveness in few-shot action recognition.
Loading