Implicit Compositional Generative Network for Length-Variable Co-Speech Gesture Synthesis

Chenghao Xu, Jiexi Yan, Yanhua Yang, Cheng Deng

Published: 2024, Last Modified: 27 Jan 2026IEEE Trans. Multim. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Co-speech gesture synthesis is a practical yet challenging task that aims to generate body motion sequences in line with speech audio. Most of the existing methods can only generate the gesture sequence with a fixed number of frames, which does not satisfy the high-quality requirement of the virtual speech video in real-world applications. In this paper, we propose a novel Implicit Compositional Generative Network (ICGN) for length-variable co-speech gesture synthesis. In ICGN, the implicit neural representation is captured and optimized for a whole gesture sequence of arbitrary length with temporal embeddings. Moreover, to enforce the synthesized gestures more realistic and consistent, we compositionally generate the gesture sequence through a well-designed asymmetric two-stream network that effectively captures and utilizes the rich correlations between speech audio and human body motions. In this way, the coarse and fine-grained gestures are synthesized, respectively, according to the corresponding content-aware and emotion-aware audio components. Extensive experiments on four widely-used benchmarks demonstrate that the proposed method renders realistic human gestures and achieves the superior performance against several state-of-the-art methods.