Retrieval-Augmented Egocentric Video Captioning

Published: 2024, Last Modified: 21 Jan 2026CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Understanding human actions from videos offirst-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos, (2) for training the cross-view retrieval module, we devise an au-tomatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets, (3) we train the cross-view retrieval module with a novel EgoEx-oNCE loss that pulls egocentric and exocentric video features closer, by aligning them to shared text features that describe similar actions, (4) through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocen-tric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references.
Loading