Microscope: Efficient Diffusion with Two-Stage Dynamics Compression for High-Quality Talking Head Generation

Hanbo Cheng; Chenyu Liu; Pengfei Hu; Jiefeng Ma; Jun Du; Jia Pan; Jinshui Hu; Baocai Yin; Cong Liu; Jianqing Gao

Microscope: Efficient Diffusion with Two-Stage Dynamics Compression for High-Quality Talking Head Generation

Hanbo Cheng, Chenyu Liu, Pengfei Hu, Jiefeng Ma, Jun Du, Jia Pan, Jinshui Hu, Baocai Yin, Cong Liu, Jianqing Gao

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Talking Head Generation, Auto-Encoder, Two-Stage Compression, Efficient Video Diffusion Model

TL;DR: We introduce a non-autoregressive framework for talking head generation, enabling high-quality 512×512 video synthesis of 1600+ frames on a single 16GB GPU with high quality and inference speed.

Abstract: The talking head generation task synthesizes videos from a single portrait image and audio input, animating the portrait to deliver the speech content. Non-autoregressive (NAR) approaches for talking head generation have demonstrated impressive quality and generation speeds by producing video frames in parallel, thereby overcoming the error accumulation problems inherent in frame-wise autoregressive (AR) methods. However, NAR methods face limited practical applications due to prohibitive VRAM requirements, especially when generating long sequences ( $\leq 1000$ frames) at high resolution ($512 \times 512$). This paper proposes a novel framework that enables high-quality, non-autoregressive talking head generation while significantly reducing computational resource demands for both training and inference. We enhance efficiency through our Microscope Dynamics Compression Framework (MDCF), a two-stage pipeline achieving 768× compression for pixel-level dynamics latent. Additionally, we introduce a two-phase cascade training strategy to stably optimize the MDCF while effectively alleviating error accumulation during multi-stage compression. Experimental results demonstrate that our framework can non-autoregressively generate talking head videos with 1600+ frames at $512 \times 512$ on a 16GB GPU, with state-of-the-art quality and inference speed. Our approach represents a significant advancement toward practical, resource-efficient talking head synthesis for real-world applications. The source code will be made publicly available to facilitate further research.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 9326

Loading