Abstract: Dynamic Neural Radiance Fields (NeRF) have been successful in high-fidelity 3D modeling of talking portraits. However, slow training and inference speed have obstructed their potential usage. This paper proposes an efficient NeRF-based framework, which enables faster convergence and real-time synthesizing of stable talking portraits, by utilizing the recent success of grid-based NeRF. This is accomplished by decomposing the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-Spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid, where audio dynamics are modeled in a spatial-dependent manner to avoid undesirable flickering. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient. Our project page is available at https://me.kiui.moe/radnerf/.
External IDs:dblp:journals/ijcv/TangWZCHHLLZW25
Loading