Abstract: We consider the task of training a neural network to anticipate
human actions in video. This task is challenging given
the complexity of video data, the stochastic nature of the future,
and the limited amount of annotated training data. In
this paper, we propose a novel knowledge distillation framework
that uses an action recognition network to supervise the
training of an action anticipation network, guiding the latter to
attend to the relevant information needed for correctly anticipating
the future actions. This framework is possible thanks
to a novel loss function to account for positional shifts of semantic
concepts in a dynamic video. The knowledge distillation
framework is a form of self-supervised learning, and it
takes advantage of unlabeled data. Experimental results on
JHMDB and EPIC-KITCHENS dataset show the effectiveness
of our approach.
0 Replies
Loading