Unsupervised Learning from Video with Deep Neural Embeddings

Chengxu Zhuang; Tianwei She; Alex Andonian; Daniel Yamins

Unsupervised Learning from Video with Deep Neural Embeddings

Chengxu Zhuang, Tianwei She, Alex Andonian, Daniel Yamins

25 Sept 2019 (modified: 12 Oct 2025)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Unsupervised learning, action recognition, video learning, deep neural networks

Abstract: Because of the rich dynamical structure of videos and their ubiquity in everyday life, it is a natural idea that video data could serve as a powerful unsupervised learning signal for visual representations. However, instantiating this idea, especially at large scale, has remained a significant artificial intelligence challenge. Here we present the Video Instance Embedding (VIE) framework, which trains deep nonlinear embeddings on video sequence inputs. By learning embedding dimensions that identify and group similar videos together, while pushing inherently different videos apart in the embedding space, VIE captures the strong statistical structure inherent in videos, without the need for external annotation labels. We find that, when trained on a large-scale video dataset, VIE yields powerful representations both for action recognition and single-frame object categorization, showing substantially improving on the state of the art wherever direct comparisons are possible. We show that a two-pathway model with both static and dynamic processing pathways is optimal, provide analyses indicating how the model works, and perform ablation studies showing the importance of key architecture and loss function choices. Our results suggest that deep neural embeddings are a promising approach to unsupervised video learning for a wide variety of task domains.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/unsupervised-learning-from-video-with-deep/code)

Original Pdf: pdf

4 Replies

Loading