STPM: Spatial-Temporal Token Pruning and Merging for Complex Activity Recognition

Published: 01 Jan 2025, Last Modified: 03 Aug 2025IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Lightweight video representation techniques have advanced significantly for simple activity recognition, but they still encounter several issues when applied to complex activity recognition: 1) The presence of numerous individuals and varying spatial positions makes it difficult for traditional token pruning methods to maintain accuracy. 2) Simply discarding entire frames may result in the loss of crucial clues. 3) To maintain parallel computing, applying the same pruning rate to every frame leads to significant redundancy in frames with low information content. To this end, we propose a lightweight and novel Spatial-Temporal Token Pruning and Merging (STPM) framework, specifically designed for complex action videos where human actors occupy a small spatial resolution within video frames. Our framework considers two critical factors: semantic importance and spatial-temporal redundancy, to further reduce overhead. For semantic importance, STPM captures class-specific attention scores by learning multiple class tokens within the transformer to guide token pruning. For spatial-temporal redundancy, STPM employs an anchor graph and temporal attention to perform spatial and temporal token merging, preserving appearance and temporal cues while eliminating semantic duplication and redundancy. We conduct extensive experiments on JRDB-PAR primarily using recently introduced video transformer backbones, e.g., MViT and ViT. Our framework achieves similar results while requiring 40% less computation.
Loading