JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version I

XINHAN DI; Kristin Qi

JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version I

XINHAN DI, Kristin Qi

Published: 07 Aug 2025, Last Modified: 14 Aug 2025Gen4AVC PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: joint-video-audio-generation, dataset, whole-body, dataset

Abstract: Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand‑centric and whole‑body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at \href{https://github.com/silent-commit/WholeBodyBenchmark}{https://github.com/silent-commit/WholeBodyBenchmark}.

Submission Number: 1

Loading