Keywords: video action recognition, motion feature learning, space-time self-similarity, higher-order similarity
TL;DR: We introduce multi-order self-similarity (MOSS) module designed to learn and integrate multi-order space-time self-similarity features to model diverse aspects of spatio-temporal dynamics in videos.
Abstract: Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we propose higher-order STSS and demonstrate how STSS at different orders reveal distinct aspects of these dynamics. We then introduce multi-order self-similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features and readily applied to video classification architectures to enhance motion modeling capabilities while consuming only marginal computation cost and memory usage. Evaluated on Kinetics-400 and Something-Something V1 & V2 benchmarks, our method achieves strong performances, achieving the best memory-accuracy trade-off compared to state-of-the-art approaches. Source code and checkpoints of our model will be publicly available.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 15816
Loading