Abstract: Human action segmentation in the video analysis for HCI (human-computer interaction) applications has been extensively studied to get the category and start time of actions that occur in videos. However, it remains an unsolved problem due to the lack of large amounts of accurate annotation of begin frame, end frame, and action category annotation data in the applications of video analysis. To handle this issue, weakly supervised action segmentation based on the transcript only uses the action annotation on the whole sequence in a long video instead of specific labeling of each frame, which significantly reduces the difficulty of obtaining delicately labeled video datasets. However, the task remains challenging for the video’s complex temporal length partition of actions. In this paper, we use the Viterbi algorithm to generate the initial and coarse action segmentation as the baseline and then design a coarse-to-fine learning framework to refine the length partition. By connecting the candidate frames of the initial segmentation points in an orderly fashion and constructing a fully connected directed graph, a new coarse-to-fine loss function is designed to learn the scores of valid and invalid segmentation paths, respectively. The framework learns the coarse-to-fine loss function in an end-to-end manner to reduce the weight of the scores of invalid segmentation paths and obtain the best video segmentation. Compared with the state-of-the-art methods, the experiments on the breakfast and 50salads datasets show that our fine partition model and coarse-to-fine loss function can obtain higher frame accuracy and significantly reduce the time spent for human action segmentation in HCI videos. The source code will be made publicly available (https://github.com/WeaklyActionSegmentation).
Loading