Abstract: Temporal action localization (TAL) is a critical task in video understanding. Effectively utilizing multi-scale information and handling interactions across various scales have consistently posed challenging issues within the realm of TAL. In this article, we propose a novel gated multi-scale Transformer model (TransGMC) for temporal action localization. A gated control mechanism is designed to filter and aggregate the information at different scales, by which the contributions of contexts at different temporal scales are well characterized. To enhance the feature representation at each temporal scale, the rich global-local contexts are extracted at each temporal scale. A cascade attention module that contains two seamlessly integrated channel attention and moment attention is proposed for capturing global temporal contexts. We utilize a new regression loss function for locating the time boundaries. We conducted experiments on four challenging benchmark datasets, including two third-person view datasets and two first-person view datasets. Our method achieves an average mAP of 67.5% on THUMOS14, 36.1% on ActivityNet v1.3, 24.9% on EPIC-Kitchens 100, and 23.2% on Ego4D, which all outperform the previous state-of-the-arts methods. Extensive ablation studies also validate the effectiveness of the proposed method.
Loading