Temporal Action Localization with Global Segmentation Mask TransformersDownload PDF

Sep 29, 2021 (edited Oct 06, 2021)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
  • Keywords: Temporal Action Localization, Transformer, Global Contextual Learning, Self-attention Learning
  • Abstract: Inspired by the promising results of Transformers in object detection in images, it is interesting to formulate Transformer based methods for temporal action localization (TAL) in videos. Nonetheless, this is non-trivial to adapt recent object detection transformers due to two unique challenges with TAL: (1) more complex spatio-temporal visual observations, and (2) less training data availability . In this paper, to address the above two challenges, a novel {\em Global Segmentation Mask Transformer} (GSMT) is proposed. Compared to object detection transformers, it is architecturally reformulated with the core idea to drive the transformer to learn {\em global segmentation masks} of all action instances jointly at the full video length. Supervised by such global temporal structure signals, GSMT allows to more effectively train from limited complex video data. Due to modeling TAL holistically rather than locally to each individual proposal, our model also differs significantly to the conventional proposal-based TAL methods that learn to detect local start and end points of action instances using more complex architectures. Extensive experiments show that despite its simpler design, GSMT outperforms existing TAL methods, achieving new state-of-the-art performance on two benchmarks. Importantly, it is around $\bf{100\times}$ faster to train and twice as efficient for inference.
5 Replies