Do We Really Need Temporal Convolutions in Action Segmentation?Download PDFOpen Website

Published: 01 Jan 2023, Last Modified: 21 Sept 2023ICME 2023Readers: Everyone
Abstract: Recognizing and segmenting actions from long videos is a challenging problem. Most existing methods focus on designing temporal convolutional models. However, these models are limited in their flexibility and ability to model long-term dependencies. Transformers have recently been used in various tasks. But the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformers in action segmentation. In this paper, we present a pure Transformer-based model without temporal convolutions in action segmentation, called Temporal U-Transformer. The U-Transformer architecture not only reduces complexity but also introduces an inductive bias that neighboring frames are more likely to belong to the same class. Besides, we further propose a boundary-aware loss based on the distribution of similarity scores between frames from attention modules to improve the ability to recognize boundaries. Extensive experiments show the effectiveness of our method.
0 Replies

Loading