Abstract: In this paper, we tackle the problem of self-supervised
video alignment and activity progress prediction using inthe-wild videos. Our proposed self-supervised representation learning method carefully addresses different action
orderings, redundant actions, and background frames to
generate improved video representations compared to previous methods. Our model generalizes temporal cycleconsistency learning to allow for more flexibility in determining cycle-consistent neighbors. More specifically,
to handle repeated actions, we propose a multi-neighbor
cycle consistency and a multi-cycle-back regression loss
by finding multiple soft nearest neighbors using a Gaussian Mixture Model. To handle background and redundant
frames, we introduce a context-dependent drop function in
our framework, discouraging the alignment of droppable
frames. On the other hand, to learn from videos of multiple activities jointly, we propose a multi-head crosstask network, allowing us to embed a video and estimate progress
without knowing its activity label. Experiments on multiple
datasets show that our method outperforms the state-of-theart for video alignment and progress prediction.
Loading