Abstract: Temporal Action Localization (TAL) aims to classify and localize all actions within untrimmed videos. Existing TAL methods often struggle with inaccurate boundary predictions due to the similarity of action content and the uncertainty of boundaries between adjacent frames. Many of these methods rely on fixed or global proposal learning strategies, which lack a more refined method to improve localization accuracy. In this paper, we propose BRTAL, a new Boundary Refinement framework for TAL based on an offset-driven diffusion model, specifically designed to enhance action boundary precision through a refined approach iteratively. Unlike traditional TAL methods emphasizing global target predictions, BRTAL adopts a local refinement perspective by leveraging an offset-driven strategy. Specifically, our framework employs diffusion to iteratively generate local offsets between predictions and ground truth, gradually reducing these offsets to achieve better alignment with the ground truth. This refined approach is particularly effective in addressing the challenges of ambiguous boundaries frequently encountered in TAL, enabling BRTAL to achieve more refined boundary localization than existing methods. Furthermore, we introduce a lightweight yet powerful Temporal Context Modeling (TCM) module to enhance temporal information modeling for accurate action localization. TCM features a Temporal Representation Perception (TRP) layer, which captures temporal evolution and long-term contextual dependencies through a squeeze-and-excitation design combined with large convolutional kernels, ensuring robust temporal representation learning. Extensive experiments on THUMOS14, ActivityNet-1.3, and EPIC-KITCHEN 100 datasets highlight the significant advantages of BRTAL. Notably, BRTAL achieves an average mAP of 69.6% on THUMOS14, establishing a new state-of-the-art benchmark and demonstrating its outstanding boundary refinement capability.
External IDs:dblp:journals/tcsv/LiuLFX25
Loading