Abstract: The goal of unsupervised action segmentation (UAS) is to classify video frames into predefined action classes, which can be considered as a clustering or boundary detection problem. Previous research utilizing bottom-up agglomerative hierarchical clustering methods suffers from over-segmentation or under-segmentation. To address these problems, we propose the Two-step Temporal Divisive Clustering (TTDC) with two components. The first step of TTDC is top-down Temporal Divisive Clustering (TDC), which captures global contexts by comparing the intra-class variances of different classes, and captures local contexts through boundary detection. The second step is the Self-supervised Soft Boundary Regression Network (SS-BRN). SS-BRN is trained by soft pseudo-labels from TDC to refine the boundaries of clusters. In addition, to alleviate the issue of low confidence in pseudo-labels, we use a loss function with soft pseudo-labels. Our empirical evaluations on three benchmarks including 50Salads, Breakfast, and MPII Cooking 2 dataset demonstrate that TTDC outperforms the state-of-the-art methods.
Loading