Learning to Segment Actions from Visual and Language Instructions via Differentiable Weak Sequence Alignment
Abstract: We address the problem of unsupervised localization
of task-relevant actions (key-steps) and feature learning
in instructional videos using both visual and language
instructions. Our key observation is that the sequences of
visual and linguistic key-steps are weakly aligned: there
is an ordered one-to-one correspondence between most
visual and language key-steps, while some key-steps in
one modality are absent in the other. To recover the
two sequences, we develop an ordered prototype learning
module, which extracts visual and linguistic prototypes
representing key-steps. To find weak alignment and perform
feature learning, we develop a differentiable weak sequence
alignment (DWSA) method that finds ordered one-to-one
matching between sequences while allowing some items in
a sequence to stay unmatched. We develop an efficient forward and backward algorithm for computing the alignment
and the loss derivative with respect to parameters of visual
and language feature learning modules. By experiments on
two instructional video datasets, we show that our method
significantly improves the state of the art.
0 Replies
Loading