<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Supplementary Material</title>
    <link rel="stylesheet" href="css/style.css">
</head>
<body>
    <h1>ESCA: Enabling Seamless Codec Avatar Execution<br>through Algorithm and Hardware Co-Optimization for Virtual Reality</h1>
    <p>Supplementary Material for NeurIPS Submission 19446.</p>

    <h2>Abstract</h2>
    <p class="abstract">Photorealistic Codec Avatars (PCA), which enable high-fidelity human face rendering, are increasingly adopted in AR/VR applications to support immersive communication and interaction via deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained AR/VR devices such as head-mounted displays (HMDs), where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip (SoC) of AR/VR devices to further enhance processing efficiency. Building on these components, we introduce <i>ESCA</i>, a full-stack optimization framework that accelerates PCA inference on edge AR/VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to +0.39 over the best 4-bit baseline, delivers up to 3.36× latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.</p>

    <h2>Comparison of Avatars with Full and Quantized Models</h2>
    <video width="1000" controls autoplay muted>
        <source src="assets/esca_w4a4.mp4" type="video/mp4">
        Your browser does not support the video tag.
    </video>
    <p class="video-caption">
        Left: Avatar rendered using the full precision MultiFace model (CVPR 2023-3DMV)<br>
        Middle: Degraded avatar showing noise and jitter artifacts from state-of-the-art post-training quantization (INT4)<br>
        Right: Clean, stable avatar achieved through the proposed ESCA quantization method (INT4)
    </p>

    <h2>Quantization Pipeline</h2>
    <img src="assets/pipeline.png" alt="Pipeline" width="1000">

    <h2>Main Contributions</h2>
    <ul class="contributions">
        <li>
            <strong>Input Channel-wise Activation Smoothing (ICAS):</strong> We introduce a novel input channel-wise smoothing module inserted during training to alleviate extreme inter-channel activation disparities in the VAE decoder. By reducing outlier activations, ICAS diminishes quantization error and prevents aberrations when the model is later quantized to low bit-widths.
        </li>
        <li>
            <strong>Facial-Feature-Aware Smoothing (FFAS):</strong> We develop a region-aware smoothing strategy that uses facial masks to identify key areas like the eyes and mouth. Based on the activation variance in these regions, FFAS selectively skips smoothing for the channels most critical to fine details, preserving important textures while still smoothing less sensitive regions.
        </li>
        <li>
            <strong>UV-weighted Hessian-Based Weight Quantization:</strong> We propose a weight quantization scheme guided by a UV-weighted Hessian matrix of the decoder's loss. This method computes second-order sensitivity and weights it by the UV importance of each face region, thereby prioritizing the precision of weights that most affect critical facial features. This results a low-bit model that maintains high reconstruction fidelity in salient areas of the face.
        </li>
        <li>
            <strong>Customized DNN Hardware Accelerator:</strong> We co-design a specialized hardware accelerator to support our quantized Codec Avatar model with high-throughput 4-bit and 8-bit operations. The accelerator features an input-combining mechanism to exploit the structured sparsity of the activation matrix. Moreover, an optimized end-to-end pipeline is applied to deliver over 100 FPS inference on an AR/VR headset ensuring smooth, real-time avatar rendering.
        </li>
    </ul>

    <h2>VDP scores</h2>
    <div class="table-container">
        <table class="vdp-table">
            <thead>
                <tr>
                    <th>Method</th>
                    <th>Precision</th>
                    <th>Front</th>
                    <th>Left</th>
                    <th>Right</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>Full Model</td>
                    <td rowspan="1">FP32</td>
                    <td>6.5364</td>
                    <td>5.9480</td>
                    <td>5.8625</td>
                </tr>
                <tr class="section-sep"><td colspan="5"></td></tr>
                <tr>
                    <td>Adaround+LSQ</td>
                    <td rowspan="8">W4A4</td>
                    <td>4.2531</td>
                    <td>3.6143</td>
                    <td>3.5606</td>
                </tr>
                <tr><td>POCA</td><td>5.2310</td><td>4.3838</td><td>4.3457</td></tr>
                <tr><td>2DQuant</td><td>5.2987</td><td>4.3948</td><td>4.3712</td></tr>
                <tr><td>GPTQ</td><td>5.4980</td><td>4.5868</td><td>4.5729</td></tr>
                <tr><td>ICAS (Ours)</td><td>5.5901</td><td>4.7317</td><td>4.7536</td></tr>
                <tr><td>UV-W (Ours)</td><td>5.7559</td><td>4.8130</td><td>4.8187</td></tr>
                <tr><td>ICAS-UV (Ours)</td><td>5.6438</td><td>4.9145</td><td>4.9057</td></tr>
                <tr><td>FFAS-UV (Ours)</td><td><strong>5.8541</strong></td><td><strong>4.9795</strong></td><td><strong>4.9605</strong></td></tr>
                <tr class="section-sep"><td colspan="5"></td></tr>
                <tr>
                    <td>Adaround+LSQ</td>
                    <td rowspan="8">W8A8</td>
                    <td>6.2106</td>
                    <td>5.5004</td>
                    <td>5.4381</td>
                </tr>
                <tr><td>POCA</td><td>6.4827</td><td>5.8511</td><td>5.7565</td></tr>
                <tr><td>2DQuant</td><td>6.4983</td><td>5.8313</td><td>5.7497</td></tr>
                <tr><td>GPTQ</td><td>6.2359</td><td>5.6188</td><td>5.3613</td></tr>
                <tr><td>ICAS (Ours)</td><td>5.6007</td><td>5.3913</td><td>5.0762</td></tr>
                <tr><td>UV-W (Ours)</td><td>6.5271</td><td><strong>5.9101</strong></td><td>5.7610</td></tr>
                <tr><td>ICAS-UV (Ours)</td><td>6.3690</td><td>5.6615</td><td>5.5998</td></tr>
                <tr><td>FFAS-UV (Ours)</td><td><strong>6.5241</strong></td><td>5.8589</td><td><strong>5.8071</strong></td></tr>
            </tbody>
        </table>
    </div>

    <h2>Inference Latency</h2>
    <div class="table-container">
        <table>
            <thead>
                <tr>
                    <th>Model</th>
                    <th>Device</th>
                    <th>Latency (ms)</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>Encoder (full)</td>
                    <td>Snapdragon XR2 Gen 2</td>
                    <td>13.80</td>
                </tr>
                <tr>
                    <td>Encoder (full)</td>
                    <td>NVIDIA Jetson Orin NX 16GB</td>
                    <td>9.96</td>
                </tr>
                <tr>
                    <td>Encoder (8 bit)</td>
                    <td>Snapdragon XR2 Gen 2</td>
                    <td>4.00</td>
                </tr>
                <tr>
                    <td><strong>Encoder (8 bit)</strong></td>
                    <td><strong>Our hardware accelerator</strong></td>
                    <td><strong>3.05</strong></td>
                </tr>
                <tr>
                    <td>Decoder (full)</td>
                    <td>NVIDIA Jetson Orin NX 16GB</td>
                    <td>50.35</td>
                </tr>
                <tr>
                    <td>Decoder (full)</td>
                    <td>Snapdragon XR2 Gen 2</td>
                    <td>25.80</td>
                </tr>
                <tr>
                    <td>Decoder (8 bit)</td>
                    <td>Snapdragon XR2 Gen 2</td>
                    <td>14.50</td>
                </tr>
                <tr>
                    <td>Decoder (8 bit)</td>
                    <td>Our hardware accelerator</td>
                    <td>12.51</td>
                </tr>
                <tr>
                    <td><strong>Decoder (4 bit)</strong></td>
                    <td><strong>Our hardware accelerator</strong></td>
                    <td><strong>3.13</strong></td>
                </tr>
            </tbody>
        </table>
    </div>
</body>
</html>
