Abstract: Training a temporal action segmentation (TAS) model on
long and untrimmed videos requires gathering framewise video annotations, which is very costly. We propose a two-stage active learning framework to efficiently learn a TAS model using only a small amount of video
annotations. Our framework consists of three components that work together in each active learning iteration. 1) Using current labeled frames,
we learn a TAS model and action prototypes using a novel contrastive
learning method. Leveraging prototypes not only enhances the model
performance, but also increases the computational efficiency of both
video and frame selection for labeling, which are the next components
of our framework. 2) Using the currently learned TAS model and action
prototypes, we select informative unlabeled videos for annotation. To do
so, we find unlabeled videos that have low alignment scores to learned
action prototype sequences in labeled videos. 3) To annotate a small
subset of informative frames in each selected unlabeled video, we propose a video-aligned summary selection method and an efficient greedy
search algorithm. By evaluation on four benchmark datasets (50Salads,
GTEA, Breakfast, CrossTask), we show that our method significantly reduces the annotation costs, while consistently surpassing baselines over
active learning iterations. Our method achieves comparable or better
performance than other weakly supervised methods while using a small
amount of labeled frames. We further extend our framework to a semisupervised active learning setting. To the best of our knowledge, this is
the first work studying active learning for TAS.
Loading