Disentangling Motion, Foreground and Background Features in Videos

Xunyu Lin; Victor Campos; Xavier Giro-i-Nieto; Jordi Torres; Cristian Canton Ferrer

Disentangling Motion, Foreground and Background Features in Videos

Xunyu Lin, Victor Campos, Xavier Giro-i-Nieto, Jordi Torres, Cristian Canton Ferrer

07 Jul 2025 (modified: 16 May 2017)CVPR 2017 BNMW SubmissionReaders: Everyone

Paper Length: 4 page

Abstract: This paper instroduces an unsupervised framework to extract semantically rich features for video representation. Inspired by how the human visual system groups objects based on motion cues, we propose a deep convolutional neural network that disentangles motion, foreground and background information. The proposed architecture consists of a 3D convolutional feature encoder for blocks of 16 frames, which is trained for reconstruction tasks over the first and last frames of the sequence. The model is trained with a fraction of videos from the UCF-101 dataset taking as ground truth the bounding boxes around the activity regions. Qualitative results indicate that the network can successfully update the foreground appearance based on pure-motion features. The benefits of these learned features are shown in a discriminative classification task when compared with a random initialization of the network weights, providing a gain of accuracy above the 10%.

Conflicts: bsc.es, dfki.de, columbia.edu, upc.edu, google.com, vilynx.com, insight-center.org, dcu.ie, buaa.edu.cn, umich.edu, facebook.com

1 Reply

Loading