Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers

Xin Hu, Kai Li, Deep Patel, Erik Kruus, Martin Renqiang Min, Zhengming Ding

Published: 01 Jan 2024, Last Modified: 17 Apr 2025CVPR Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Weakly-Supervised Temporal Action Localization (WS-TAL) aims to jointly localize and classify action segments in untrimmed videos with only video-level annotations. To leverage video-level annotations, most existing methods adopt the multiple instance learning paradigm where frame-/snippet-level action predictions are first produced and then aggregated to form a video-level prediction. Although there are trials to improve snippet-level predictions by modeling temporal relationships, we argue that those implementations have not sufficiently exploited such information. In this paper, we propose Multi-Modal Plateau Transformers (M 2 PT) for WS-TAL by simultaneously exploiting temporal relationships among snippets, complementary information across data modalities, and temporal coherence among consecutive snippets. Specifically, M 2 PT explores a dual-Transformer architecture for RGB and optical flow modalities, which models intra-modality temporal relationship with a self-attention mechanism and inter-modality temporal relationship with a cross-attention mechanism. To capture the temporal coherence that consecutive snippets are supposed to be assigned with the same action, M 2 PT deploys a Plateau model to refine the temporal localization of action segments. Experimental results on popular benchmarks demonstrate that our proposed M 2 PT achieves state-of-the-art performance.