Benchmarking Speech-Driven Gesture Generation Models for Generalization to Unseen Voices and Noisy Environments

JOHSAC ISBAC GOMEZ SANCHEZ; Kevin Inofuente-Colque; Leonardo Boulitreau de Menezes Martins Marques; Paula Dornhofer Paro Costa; Rodolfo Luis Tonoli

Benchmarking Speech-Driven Gesture Generation Models for Generalization to Unseen Voices and Noisy Environments

JOHSAC ISBAC GOMEZ SANCHEZ, Kevin Inofuente-Colque, Leonardo Boulitreau de Menezes Martins Marques, Paula Dornhofer Paro Costa, Rodolfo Luis Tonoli

Published: 31 Jul 2024, Last Modified: 21 Aug 2024GENEA Workshop 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: gesture generation, robustness, voice conversion, noise

Abstract: Speech-driven gesture generation models enhance robot gestures and control avatars in virtual environments by synchronizing gestures with speech prosody. However, state-of-the-art models are trained on a limited number of speakers, with audios typically recorded in controlled conditions, potentially resulting in poor generalization to new voices and noisy environments. This paper presents a robust evaluation method for speech-driven gesture generation models against unseen voices and varying noise levels. We utilize a voice conversion model to produce synthetic speech that maintains prosodic features, ensuring a thorough test of the model's generalization capabilities. Additionally, we introduce a controlled synthetic noisy dataset to evaluate model performance under different noise conditions. This methodology establishes a comprehensive framework for robustness evaluation in speech-to-gesture synthesis benchmarks. Applying this approach to the state-of-the-art DiffuseStyleGesture+ model reveals a slight performance degradation with diverse voices and increased background noise. Our findings emphasize the need for models that can generalize better to real-world conditions, ensuring reliable performance in varied acoustic scenarios.

Submission Number: 8

Loading