Learning to Predict Activity Progress by Self-Supervised Video Alignment

Gerard Lawrence Donahue, Ehsan Elhamifar

Published: 17 Jun 2024, Last Modified: 15 Sept 2024IEEE Conference on Computer Vision and Pattern Recognition (CVPR)EveryoneCC0 1.0

Abstract: In this paper, we tackle the problem of self-supervised video alignment and activity progress prediction using inthe-wild videos. Our proposed self-supervised representation learning method carefully addresses different action orderings, redundant actions, and background frames to generate improved video representations compared to previous methods. Our model generalizes temporal cycleconsistency learning to allow for more flexibility in determining cycle-consistent neighbors. More specifically, to handle repeated actions, we propose a multi-neighbor cycle consistency and a multi-cycle-back regression loss by finding multiple soft nearest neighbors using a Gaussian Mixture Model. To handle background and redundant frames, we introduce a context-dependent drop function in our framework, discouraging the alignment of droppable frames. On the other hand, to learn from videos of multiple activities jointly, we propose a multi-head crosstask network, allowing us to embed a video and estimate progress without knowing its activity label. Experiments on multiple datasets show that our method outperforms the state-of-theart for video alignment and progress prediction.