High-Fidelity Human Motion Generation with Motion Quality Feedbacks

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: human motion generation; text-to-motion generation
Abstract: Text-to-motion generation aims to synthesize realistic human motions from natural language descriptions. Prevailing approaches typically condition generative models on embeddings from the pre-trained CLIP text encoder. However, a fundamental discrepancy exists: CLIP's embeddings are optimized for static visual semantics, failing to capture the dynamic nuances essential for motion, consequently leading to suboptimal generation quality. To bridge this semantic gap, we propose AdaQF, a novel diffusion-based framework that enables the autonomous and efficient adaptation of the CLIP text encoder through feedback-driven co-optimization. AdaQF introduces a quality feedback loop, where semantic consistency constraints, between the generated motion, the conditioning text, and the ground truth motion, guide the fine-tuning of the CLIP encoder via low-rank adaptation. This process yields AdaCLIP, a motion-specialized text encoder that produces semantically rich and dynamic-aware embeddings. Our framework delivers advantages from three perspectives: it achieves state-of-the-art performance on standard benchmarks, achieving state-of-the-art results with an FID of 0.039 and an R-Precision of 0.888 on the HumanML3D database; it facilitates dramatically faster convergence (up to 8x); moreover, the resulting AdaCLIP module demonstrates remarkable transferability, serving as a versatile drop-in replacement that elevates the performance of various motion generation models including the VQ-VAE-based and latent diffusion-based ones, thus presenting a general and efficient solution for high-fidelity text-to-motion synthesis. The code of this paper will be released.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6708
Loading