Exploring High-Order Self-Similarity for Video Understanding

Manjin Kim; Heeseung Kwon; Karteek Alahari; Minsu Cho

Exploring High-Order Self-Similarity for Video Understanding

Manjin Kim, Heeseung Kwon, Karteek Alahari, Minsu Cho

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: video action recognition, motion feature learning, space-time self-similarity, higher-order similarity

TL;DR: We introduce multi-order self-similarity (MOSS) module designed to learn and integrate multi-order space-time self-similarity features to model diverse aspects of spatio-temporal dynamics in videos.

Abstract: Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we propose higher-order STSS and demonstrate how STSS at different orders reveal distinct aspects of these dynamics. We then introduce multi-order self-similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features and readily applied to video classification architectures to enhance motion modeling capabilities while consuming only marginal computation cost and memory usage. Evaluated on Kinetics-400 and Something-Something V1 & V2 benchmarks, our method achieves strong performances, achieving the best memory-accuracy trade-off compared to state-of-the-art approaches. Source code and checkpoints of our model will be publicly available.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 15816

Loading