IACFormer: a transformer framework with instantaneous average convolution for temporal action detection
Abstract: Temporal Action Detection (TAD) in video understanding involves action detection and boundary localization. Accurately pinpointing the start and end times of action instances remains a major challenge. Although existing Transformer-based methods such as ActionFormer exhibit good detection performance, they are deficient in capturing local details and are not well-suited for temporal action localization. Specifically, the similarity of adjacent frames and the use of a single regression head lead to inaccurate boundary localization. To address these issues, we propose the IACFormer. It makes four key improvements: Firstly, the Instantaneous Average Convolution Module (IACM) and Instantaneous Average Dilated Convolution Module (IADCM) combined with Multi-head Global Self-attention Blocks enhance local feature learning. Secondly, incorporating nonlinear branches into IACM and IADCM sharpens the distinction of adjacent frame features. Thirdly, integrating dilated convolutions in deep IADCM promotes deep feature learning. Fourthly, employing start and end boundary regression heads on the detection head enables accurate positioning of start and end times. Additionally, through knowledge distillation, on the datasets THUMOS14, ActivityNet 1.3, and EPIC-Kitchens 100, the IACFormer significantly reduces the video memory overhead without sacrificing the average precision and recall rate. The IACFormer outperforms the baselines on the THUMOS14, ActivityNet 1.3, HACS and EPIC-Kitchens 100 datasets, achieving the state-of-the-art on THUMOS14 and EPIC-Kitchens 100.
External IDs:dblp:journals/apin/ZhangXLWYGZ25
Loading