Abstract: In the task of action segmentation, the goal is to partition a lengthy, untrimmed video into a series of action segments. Recently, Transformer-based methods have outperformed the previous temporal convolutional networks (TCNs) in terms of overall performance. However, both TCNs and Transformers encounter the challenge of over-segmentation. Prior approaches often relied on post-processing techniques to address this issue, but these methods are not universally applicable to every model and may sometimes result in performance degradation. Therefore, in this paper, we propose a set of loss functions to enhance representation learning and employ a multi-task learning approach to strengthen the model’s ability to identify action boundaries. Through extensive experiments, we validate that our method demonstrates significant improvements, particularly in addressing the challenge of over-segmentation.
Loading