From Embeddings to Language Models: A Comparative Analysis of Feature Extractors for Text-Only and Multimodal Gesture Generation
Abstract: Generating expressive and contextually appropriate co-speech gestures is crucial for naturalness in human-agent interaction. While Large Language Models (LLMs) have shown great potential for this task, questions remain regarding the optimal integration of multimodal features and the capabilities of smaller, more accessible models. This study presents a systematic and comparative evaluation of seven gesture generation pipelines, using a robust diffusion-based architecture as our foundation.
We investigate the impact of audio (WavLM, Whisper) and text (Word2Vec, Llama-3.2-3B-Instruct) feature extractors to assess the relative contribution of each modality to overall performance. We demonstrate that it is possible to achieve state-of-the-art performance using a significantly smaller LLM (3B parameters) than previous benchmarks, without sacrificing quality.
Our results, based on objective metrics and a comprehensive perceptual evaluation, reveal that pipelines incorporating Llama-3.2-3B-Instruct not only outperform references in semantic appropriateness and human-likeness but are also perceived as more appropriate by human evaluators. This work offers guidance for feature and model selection in gesture synthesis, balancing generative quality with model accessibility.
Submission Number: 5
Loading