StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos
Abstract: Instructional videos are an important resource to learn
procedural tasks from human demonstrations. However,
the instruction steps in such videos are typically short and
sparse, with most of the video being irrelevant to the procedure. This motivates the need to temporally localize the
instruction steps in such videos, i.e. the task called keystep localization. Traditional methods for key-step localization require video-level human annotations and thus do
not scale to large datasets. In this work, we tackle the problem with no human supervision and introduce StepFormer, a
self-supervised model that discovers and localizes instruction steps in a video. StepFormer is a transformer decoder
that attends to the video with learnable queries, and produces a sequence of slots capturing the key-steps in the
video. We train our system on a large dataset of instructional videos, using their automatically-generated subtitles
as the only source of supervision. In particular, we supervise our system with a sequence of text narrations using an
order-aware loss function that filters out irrelevant phrases.
We show that our model outperforms all previous unsupervised and weakly-supervised approaches on step detection
and localization by a large margin on three challenging
benchmarks. Moreover, our model demonstrates an emergent property to solve zero-shot multi-step localization and
outperforms all relevant baselines at this task.
Loading