Token-Level Contrast for Video and Language Alignment

Jianwei Yang; Yonatan Bisk; Jianfeng Gao

Token-Level Contrast for Video and Language Alignment

Jianwei Yang, Yonatan Bisk, Jianfeng Gao

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone

Keywords: token-level contrastive loss, video and language alignment, video retrieval, multi-modal representation learning

Abstract: Building video and language understanding models requires grounding linguistic concepts and video contents into a shared space. Most of previous works learn a holistic alignment between them while neglecting the token-level grounding. Masked token prediction can be used to learn token-level multi-modal representation, but it does not necessarily force lexical grounding on perception and also introduce a domain-shift between pretraining and fine-tuning. This paper introduces a simple token-level contrastive loss (ToCo) informed by syntactic classes (e.g., nouns and verbs) to force the model to prioritize grounding concrete semantic bearing words. ToCo does not mask inputs but poses both local (contextual token) and global (lexical type) pressures for multi-modal alignment in a contrastive manner. Our approach enables a simple vanilla BERT-based multimodal transformer to compete with or outperform existing heavily engineered multi-loss or large models on three benchmarks (YouCook2, MSR-VTT and CrossTask). Further, it is plug-n-play such that gains are made in both pretraining and downstream tasks solely, regardless of the underlying visual or textual feature representations.

One-sentence Summary: We propose a simple token-level contrastive loss to help learning a better alignment and multi-modal representation for video and language tasks.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=YEuRd5fkv

5 Replies

Loading