4DIFF: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Feng Cheng; Mi Luo; Huiyu Wang; Alex Dimakis; Lorenzo Torresani; Gedas Bertasius; Kristen Grauman

4DIFF: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, Kristen Grauman

Published: 01 Jan 2024, Last Modified: 19 May 2025ECCV (24) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We present 4Diff, a 3D-aware diffusion model addressing the exo-to-ego viewpoint translation task—generating first-person (egocentric) view images from the corresponding third-person (exocentric) images. Building on the diffusion model’s ability to generate photorealistic images, we propose a transformer-based diffusion model that incorporates geometry priors through two mechanisms: (i) egocentric point cloud rasterization and (ii) 3D-aware rotary cross-attention. Egocentric point cloud rasterization converts the input exocentric image into an egocentric layout, which is subsequently used by a diffusion image transformer. As a component of the diffusion transformer’s denoiser block, the 3D-aware rotary cross-attention further incorporates 3D information and semantic features from the source exocentric view. Our 4Diff achieves state-of-the-art results on the challenging and diverse Ego-Exo4D multiview dataset and exhibits robust generalization to novel environments not encountered during training. Our code, processed data, and pretrained models are publicly available at https://klauscc.github.io/4diff.

Loading