EgoAnimate: Generating Human Animations from Egocentric top-down Views via Controllable Latent Diffusion Models
Track: Regular paper
Keywords: egocentric vision, egocentric avatars, latent diffusion models, novel view synthesis, controllable generative models, digital avatars, VR telepresence
TL;DR: EgoAnimate is a controllable diffusion pipeline that converts a single egocentric top-down view into an animatable avatar, enabling accessible VR telepresence while addressing bias, fairness, and responsible generative use.
Abstract: An ideal digital telepresence experience requires the accurate replication of a1
person’s body, clothing, and movements. In order to capture and transfer these2
movements into virtual reality, the egocentric (first-person) perspective can be3
adopted, which makes it feasible to rely on a portable and cost-effective stand-4
alone device that requires no additional front-view cameras. However, this per-5
spective also introduces considerable challenges, particularly in learning tasks, as6
egocentric data often contains severe occlusions and distorted body proportions.7
Human appearance and avatar reconstruction from egocentric views remains rela-8
tively underexplored, and approaches that leverage generative priors are rare. This9
gap contributes to limited out-of-distribution generalization and greater data and10
training requirements. We introduce a controllable latent-diffusion framework11
that maps egocentric inputs to a canonical exocentric (frontal T-pose) representa-12
tion from which animatable avatars are reconstructed. To our knowledge, this is13
the first system to employ a generative diffusion backbone for egocentric avatar14
reconstruction. Building on a Stable Diffusion prior with explicit pose/shape con-15
ditioning, our method reduces training/data burden and improves generalization16
to in-the-wild inputs. The idea of synthesizing fully occluded parts of an object17
has been widely explored in various domains. In particular, models such as SiTH18
and MagicMan have demonstrated successful 360-degree reconstruction from a19
single frontal image. Inspired by these approaches, we propose a pipeline that20
reconstructs a frontal view from a highly occluded top-down image using Control-21
Net and a Stable Diffusion backbone enabling the synthesis of novel views. Our22
objective is to map a single egocentric top-down image to a canonical frontal (e.g.,23
T-pose) representation that can be directly consumed by an image-to-motion model24
to produce an animatable avatar. This enables motion synthesis from minimal25
egocentric input and supports more accessible, data-efficient, and generalizable26
telepresence systems.
Submission Number: 39
Loading