Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning
Abstract: We address the problem of learning to segment actions
from weakly-annotated videos, i.e., videos accompanied by
transcripts (ordered list of actions). We propose a framework in which we model actions with a union of lowdimensional subspaces, learn the subspaces using transcripts and refine video features that lend themselves to action subspaces. To do so, we design an architecture consisting of a Union-of-Subspaces Network, which is an ensemble
of autoencoders, each modeling a low-dimensional action
subspace and can capture variations of an action within
and across videos. For learning, at each iteration, we generate positive and negative soft alignment matrices using
the segmentations from the previous iteration, which we use
for discriminative training of our model. To regularize the
learning, we introduce a constraint loss that prevents imbalanced segmentations and enforces relatively similar duration of each action across videos. To have a real-time inference, we develop a hierarchical segmentation framework
that uses subset selection to find representative transcripts
and hierarchically align a test video with increasingly refined representative transcripts. Our experiments on three
datasets show that our method improves the state-of-the-art
action segmentation and alignment, while speeding up the
inference time by a factor of 4 to 13.
0 Replies
Loading