ESCA: Enabling Seamless Codec Avatar Execution
through Algorithm and Hardware Co-Optimization for Virtual Reality

Supplementary Material for NeurIPS Submission 19446.

Abstract

Photorealistic Codec Avatars (PCA), which enable high-fidelity human face rendering, are increasingly adopted in AR/VR applications to support immersive communication and interaction via deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained AR/VR devices such as head-mounted displays (HMDs), where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip (SoC) of AR/VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge AR/VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to +0.39 over the best 4-bit baseline, delivers up to 3.36× latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.

Comparison of Avatars with Full and Quantized Models

Left: Avatar rendered using the full precision MultiFace model (CVPR 2023-3DMV)
Middle: Degraded avatar showing noise and jitter artifacts from state-of-the-art post-training quantization (INT4)
Right: Clean, stable avatar achieved through the proposed ESCA quantization method (INT4)

Quantization Pipeline

Pipeline

Main Contributions

VDP scores

Method Precision Front Left Right
Full Model FP32 6.5364 5.9480 5.8625
Adaround+LSQ W4A4 4.2531 3.6143 3.5606
POCA5.23104.38384.3457
2DQuant5.29874.39484.3712
GPTQ5.49804.58684.5729
ICAS (Ours)5.59014.73174.7536
UV-W (Ours)5.75594.81304.8187
ICAS-UV (Ours)5.64384.91454.9057
FFAS-UV (Ours)5.85414.97954.9605
Adaround+LSQ W8A8 6.2106 5.5004 5.4381
POCA6.48275.85115.7565
2DQuant6.49835.83135.7497
GPTQ6.23595.61885.3613
ICAS (Ours)5.60075.39135.0762
UV-W (Ours)6.52715.91015.7610
ICAS-UV (Ours)6.36905.66155.5998
FFAS-UV (Ours)6.52415.85895.8071

Inference Latency

Model Device Latency (ms)
Encoder (full) Snapdragon XR2 Gen 2 13.80
Encoder (full) NVIDIA Jetson Orin NX 16GB 9.96
Encoder (8 bit) Snapdragon XR2 Gen 2 4.00
Encoder (8 bit) Our hardware accelerator 3.05
Decoder (full) NVIDIA Jetson Orin NX 16GB 50.35
Decoder (full) Snapdragon XR2 Gen 2 25.80
Decoder (8 bit) Snapdragon XR2 Gen 2 14.50
Decoder (8 bit) Our hardware accelerator 12.51
Decoder (4 bit) Our hardware accelerator 3.13