DilatedTAD: Enhancing Adaptability to Actions of Varying Durations for Temporal Action Detection

Longyang Tang, Bo Zhang, Hui Lv, Rui Xu, Xudong Tian, Junsheng Zhou, Yi Chen

Published: 2026, Last Modified: 01 Apr 2026IEEE Trans. Circuits Syst. Video Technol. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Temporal Action Detection (TAD) aims to identify action boundaries and their corresponding categories in untrimmed videos, playing a crucial role in long-video understanding. Prior works often struggle to balance the trade-off between capturing long-range dependencies and ensuring computational efficiency. Recently, the state space model Mamba has exhibited impressive capabilities and efficiency in long-term sequence modeling. However, current methods based on Mamba generally lack a unified framework to simultaneously address the redundancy of long-duration actions and the boundary sensitivity of short-duration actions—limitations that largely stem from Mamba’s reliance on limited state representations and its unidirectional modeling. To tackle the aforementioned challenges, we propose DilatedTAD, a novel TAD framework with an expanded receptive field. DilatedTAD leverages the Inter-Parallel DIM component (InterDIM) to integrate multi-scale temporal information, enabling a better trade-off between short-duration and long-duration action detection. InterDIM is built upon our proposed Dilated Mamba (DIM), where multiple DIM branches with different dilation rates are designed to focus on actions of varying durations. Specifically, DIM introduces a novel use of dilation to skip redundant temporal information, thereby enhancing the model’s focus on crucial boundary features. Additionally, a bidirectional modeling design is adopted in DIM to compensate for the lack of future temporal context in the original Mamba architecture. Extensive experiments show that DilatedTAD outperforms state-of-the-art methods on multiple datasets, achieving mAPs of 74.9% (THUMOS14), 42.90% (ActivityNet 1.3), 45.0% (HACS), and 26.3% and 24.3% (EPIC-Kitchens 100). Our code will be publicly available.
Loading