Abstract: Segmenting activities in untrimmed videos remains a critical challenge to fully understand complex human activity sequences. A correct representation of temporal action relations is key for improving incorrect segmentations. We propose a self-attention-based model that refines initial segmentations by separately considering intra-as well as inter-segment relations between predicted action segments. Furthermore, in order to enhance the training process, we use a similarity-guided regularization technique that ensures intra-segment similarity and the validity of action transitions between adjacent segments. In an extensive evaluation on three public datasets -Georgia Tech Egocentric Activities, 50Salads, and Breakfast -our proposed architecture enhances the backbone model by 6.1% on GTEA, 3.8% on 50Salads, and 3.9% on Breakfast with regard to the F 1@50 metric.
0 Replies
Loading