Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Reuben Tan; Bryan A. Plummer; Kate Saenko; Hailin Jin; Bryan Russell

Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin Jin, Bryan Russell

Published: 09 Nov 2021, Last Modified: 05 May 2023NeurIPS 2021 SpotlightReaders: Everyone

Keywords: Self-supervision, video understanding, natural language

TL;DR: Self-supervised interaction grounding in videos

Abstract: We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter- and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.

Supplementary Material: pdf

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Code: https://cs-people.bu.edu/rxtan/projects/grounding_narrations

14 Replies

Loading