Exocentric-to-Egocentric Video Generation

Jia-Wei Liu; Weijia Mao; Zhongcong Xu; Jussi Keppo; Mike Zheng Shou

Exocentric-to-Egocentric Video Generation

Jia-Wei Liu, Weijia Mao, Zhongcong Xu, Jussi Keppo, Mike Zheng Shou

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Exocentric-Egocentric Vision, Video Generation, Viewpoint Translation

TL;DR: A novel exocentric-to-egocentric video generation method for challenging daily-life skilled human activities.

Abstract: We introduce Exo2Ego-V, a novel exocentric-to-egocentric diffusion-based video generation method for daily-life skilled human activities where sparse 4-view exocentric viewpoints are configured 360° around the scene. This task is particularly challenging due to the significant variations between exocentric and egocentric viewpoints and high complexity of dynamic motions and real-world daily-life environments. To address these challenges, we first propose a new diffusion-based multi-view exocentric encoder to extract the dense multi-scale features from multi-view exocentric videos as the appearance conditions for egocentric video generation. Then, we design an exocentric-to-egocentric view translation prior to provide spatially aligned egocentric features as a concatenation guidance for the input of egocentric video diffusion model. Finally, we introduce the temporal attention layers into our egocentric video diffusion pipeline to improve the temporal consistency cross egocentric frames. Extensive experiments demonstrate that Exo2Ego-V significantly outperforms SOTA approaches on 5 categories from the Ego-Exo4D dataset with an average of 35% in terms of LPIPS. Our code and model will be made available on https://github.com/showlab/Exo2Ego-V.

Supplementary Material: zip

Primary Area: Generative models

Submission Number: 10425

Loading