Efficient Temporal Attention with State Space Model for Temporal Action Localization

YICHENG QIU, Feng Sha, Li Niu

Published: 23 Jun 2025, Last Modified: 12 Nov 2025Neural Information Processing (ICONIP 2024)EveryoneCC BY-NC 4.0

Abstract: Temporal Action Localization (TAL) is a crucial task in the field of video understanding. Previous research in computer vision, including architectures such as multi-scale feature pyramids, transformers, and anchor-free methods, has aimed to enhance the performance of TAL tasks. However, these methods have been ineffective in extracting and learning features from videos and human actions, resulting in unsatisfactory performance. Recently, based on the State Space Model (SSM) architecture, Mamba has emerged, leading to the development of Video Mamba, marking a major advancement in video understanding. In this work, we propose a novel TAL model by integrating the SSM with an attention mechanism called Efficient Temporal Attention (ETA) to form the SS-ETA module. This approach leverages the efficient and accurate temporal sequence feature extraction capabilities of the SSM architecture and incorporates the advantages of the attention mechanism in capturing important features and long-range dependencies, addressing the limitations of existing methods in temporal feature extraction, and model scalability, significantly improving performance in TAL tasks. The sub-stantial of our method improvements have been validated through experiments across various datasets, demonstrating the effectiveness and superiority of our approach in enhancing TAL performance.