Learning Spatiotemporal Features via Video and Text Pair Discrimination

Tianhao Li; Limin Wang

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Tianhao Li, Limin Wang

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Spatiotemporal Feature Learning, Video and Text Pair Discrimination, Self-/Weakly Supervised Learning

Abstract: Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD demonstrates that pre-training on a relatively small dataset is able to yield a comparable performance to those methods of using order magnitude more data, which is meaningful and practicable for the scenarios with limited computational facilities.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: An efficient spatiotemporal feature learning method via video and text discrimination.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/learning-spatiotemporal-features-via-video/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=vGBOXsprU3

9 Replies

Loading