Real-Time Neural Radiance Talking Portrait Synthesis via Audio-Spatial Decomposition

Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Ziwei Liu, Gang Zeng, Jingdong Wang

Published: 2025, Last Modified: 08 Mar 2026Int. J. Comput. Vis. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Dynamic Neural Radiance Fields (NeRF) have been successful in high-fidelity 3D modeling of talking portraits. However, slow training and inference speed have obstructed their potential usage. This paper proposes an efficient NeRF-based framework, which enables faster convergence and real-time synthesizing of stable talking portraits, by utilizing the recent success of grid-based NeRF. This is accomplished by decomposing the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-Spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid, where audio dynamics are modeled in a spatial-dependent manner to avoid undesirable flickering. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient. Our project page is available at https://me.kiui.moe/radnerf/.

External IDs:dblp:journals/ijcv/TangWZCHHLLZW25