Abstract: Timestamp-supervised action segmentation aims to segment and classify actions in untrimmed videos with a random frame annotated per action. Precisely localizing action boundaries from timestamp annotations is crucial for this setting, as it enables generating framewise pseudo-labels and applying the well-explored fully-supervised training. However, prevailing methods struggle with intrinsic uncertainty in boundary localization due to less discriminative features in action-transiting regions. This imprecise boundary estimation significantly reduces the stability and reliability of the generated pseudo-labels in ambiguous action-transiting regions, consequently resulting in performance deterioration of the trained segmentation models. In our paper, we introduce the boundary voting network that mitigates feature ambiguity by hierarchically propagating video-level global prior knowledge into local action-transiting regions. By generating key action representations as votes throughout the video and targeting action-transiting regions, all votes collaboratively contribute to action-transiting feature enhancement and boundary localization refinement. Extensive experiments demonstrate the effectiveness of our method on GTEA, 50Salads, and Breakfast datasets.
External IDs:doi:10.1109/tcsvt.2025.3571770
Loading