FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation

Leon Harz; Hendric Voß; Stefan Kopp

FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation

Leon Harz, Hendric Voß, Stefan Kopp

Published: 04 Sept 2023, Last Modified: 30 Oct 2023GENEA Challenge 2023 MainproceedingReaders: Everyone

Keywords: machine learning, deep learning, co-speech gesture generation, gesture synthesis, multimodal data, transformer, behavior cloning, reinforcement learning

Abstract: Human communication relies on multiple modalities such as verbal expressions, facial cues, and bodily gestures. Developing computational approaches to process and generate these multimodal signals is critical for seamless human-agent interaction. A particular challenge is the generation of co-speech gestures due to the large variability and number of gestures that can accompany a verbal utterance, leading to a one-to-many mapping problem. This paper presents an approach based on a Feature Extraction Infusion Network (FEIN-Z) that adopts insights from robot imitation learning and applies them to co-speech gesture generation. Building on the BC-Z architecture, our framework combines transformer architectures and Wasserstein generative adversarial networks. We describe the FEIN-Z methodology and evaluation results obtained within the GENEA Challenge 2023, demonstrating good results and significant improvements in human-likeness over the GENEA baseline. We discuss potential areas for improvement, such as refining input segmentation, employing more fine-grained control networks, and exploring alternative inference methods.

4 Replies

Loading