EgoAnimate: Generating Human Animations from Egocentric top-down Views via Controllable Latent Diffusion Models

Published: 24 Sept 2025, Last Modified: 07 Nov 2025NeurIPS 2025 Workshop GenProCCEveryoneRevisionsBibTeXCC BY 4.0
Track: Regular paper
Keywords: egocentric vision, egocentric avatars, latent diffusion models, novel view synthesis, controllable generative models, digital avatars, VR telepresence
TL;DR: EgoAnimate is a controllable diffusion pipeline that converts a single egocentric top-down view into an animatable avatar, enabling accessible VR telepresence while addressing bias, fairness, and responsible generative use.
Abstract: An ideal digital telepresence experience requires the accurate replication of a1 person’s body, clothing, and movements. In order to capture and transfer these2 movements into virtual reality, the egocentric (first-person) perspective can be3 adopted, which makes it feasible to rely on a portable and cost-effective stand-4 alone device that requires no additional front-view cameras. However, this per-5 spective also introduces considerable challenges, particularly in learning tasks, as6 egocentric data often contains severe occlusions and distorted body proportions.7 Human appearance and avatar reconstruction from egocentric views remains rela-8 tively underexplored, and approaches that leverage generative priors are rare. This9 gap contributes to limited out-of-distribution generalization and greater data and10 training requirements. We introduce a controllable latent-diffusion framework11 that maps egocentric inputs to a canonical exocentric (frontal T-pose) representa-12 tion from which animatable avatars are reconstructed. To our knowledge, this is13 the first system to employ a generative diffusion backbone for egocentric avatar14 reconstruction. Building on a Stable Diffusion prior with explicit pose/shape con-15 ditioning, our method reduces training/data burden and improves generalization16 to in-the-wild inputs. The idea of synthesizing fully occluded parts of an object17 has been widely explored in various domains. In particular, models such as SiTH18 and MagicMan have demonstrated successful 360-degree reconstruction from a19 single frontal image. Inspired by these approaches, we propose a pipeline that20 reconstructs a frontal view from a highly occluded top-down image using Control-21 Net and a Stable Diffusion backbone enabling the synthesis of novel views. Our22 objective is to map a single egocentric top-down image to a canonical frontal (e.g.,23 T-pose) representation that can be directly consumed by an image-to-motion model24 to produce an animatable avatar. This enables motion synthesis from minimal25 egocentric input and supports more accessible, data-efficient, and generalizable26 telepresence systems.
Submission Number: 39
Loading