Do We Really Need Temporal Convolutions in Action Segmentation?

Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan

Published: 01 Jan 2023, Last Modified: 21 Sept 2023ICME 2023Readers: Everyone

Abstract: Recognizing and segmenting actions from long videos is a challenging problem. Most existing methods focus on designing temporal convolutional models. However, these models are limited in their flexibility and ability to model long-term dependencies. Transformers have recently been used in various tasks. But the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformers in action segmentation. In this paper, we present a pure Transformer-based model without temporal convolutions in action segmentation, called Temporal U-Transformer. The U-Transformer architecture not only reduces complexity but also introduces an inductive bias that neighboring frames are more likely to belong to the same class. Besides, we further propose a boundary-aware loss based on the distribution of similarity scores between frames from attention modules to improve the ability to recognize boundaries. Extensive experiments show the effectiveness of our method.

0 Replies