Unsupervised Video Representation Learning by Bidirectional Feature Prediction
Abstract: This paper introduces a novel method for self-supervised
video representation learning via feature prediction. In contrast to the previous methods that focus on future feature
prediction, we argue that a supervisory signal arising from
unobserved past frames is complementary to one that originates from the future frames. The rationale behind our
method is to encourage the network to explore the temporal structure of videos by distinguishing between future and
past given present observations. We train our model in a
contrastive learning framework, where joint encoding of future and past provides us with a comprehensive set of temporal hard negatives via swapping. We empirically show
that utilizing both signals enriches the learned representations for the downstream task of action recognition. It outperforms independent prediction of future and past.
0 Replies
Loading