EgoEnv: Human-centric environment representations from egocentric video

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 oralEveryoneRevisionsBibTeX
Keywords: egocentric video, 3D environment, sim2real, sim-to-real, episodic memory
TL;DR: We learn "environment-aware" ego-video representations that encode not just a short 1-2s clip, but also the local surroundings of the camera-wearer (e.g., what objects are nearby, how far are they).
Abstract: First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4D NLQ challenge.
Supplementary Material: pdf
