Abstract: Current timestamp-supervised temporal action segmentation (TS-TAS) methods typically follow a two-phase pipeline: initializing the model with timestamp labels and refining it with pseudo-labels. However, limited by the sparsity of timestamp annotations, current methods' performance is sub-optimal. Specifically, initializing the model with only timestamp annotations may cause overfitting to labeled frames. Additionally, sparse timestamp annotations cannot capture the diverse action representations throughout the whole instance, especially those near the ambiguous action boundaries, leading to pseudo-label noise. Inspired by the cluster assumption of semi-supervised learning (SSL) that points within the same manifold likely share the same label, we here model TS-TAS as an SSL problem. Specifically, we propose a Temporal Embedding Consistency (TEC) strategy to mitigate the excessive focus on annotated frames. The TEC strategy encourages frames with similar representations within the video to have similar classification probability distributions, thereby propagating labeled frames' information to implicit ones. Besides, we design a TS-Mix strategy to further leverage unlabeled data to mitigate the influence of pseudo-label noise in a consistency regularization manner. The TS-Mix strategy includes intra-mix, which adds linear interpolation of two adjacent timestamps to every frame between them, and inter-mix, which mixes frames from two different untrimmed videos frame-by-frame. Then the mixed video is trained with the correspondingly mixed pseudo-labels. Comprehensive experimental results on different benchmarks show that we achieve new state-of-the-art performances. Furthermore, the proposed method can seamlessly enhance existing methods, significantly improving their performances.
External IDs:dblp:journals/tmm/RenLCWWG25
Loading