Abstract: This paper presents a method to infer a 3D face avatar model from a single arbitrarily posed image, using the 3D Gaussian Splatting (3DGS) framework. Inference of a full 3DGS face model from one image is a highly ill-posed problem, requiring the estimation of hundreds of thousands, often well over a million, per-Gaussian appearance and structural parameters. To address this challenge, we draw inspiration from the classical morphable face models literature, in which individual identities are well-described as compact deformations (residuals) with respect to a canonical template face model, thereby easing the learning task. We propose leveraging such a template-plus-residuals strategy, but in the unstructured 3DGS parameter space. Rather than predicting absolute 3DGS parameters from scratch given an input face image, our proposed algorithm, FastAvatar, learns to map a face image to residual parameter values with respect to a canonical 3DGS template learned over prior multi-view face data. We couple the feed-forward prediction with a rapid inference-time latent refinement to maximize appearance fidelity to the observed image. Our evaluations on the Nersemble benchmark demonstrate that FastAvatar can generate 3DGS face models ($\sim$600K parameters) in approximately 3 seconds, with state-of-the-art reconstruction accuracy (24.01 dB PSNR and 0.91 SSIM) compared to existing feed-forward, optimization, and diffusion baselines. Our work demonstrates that residual learning offers a tractable and high-fidelity approach to image synthesis in the popular 3DGS framework.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Zhixiang_Wang1
Submission Number: 8202
Loading