Keywords: Humanoid Robot, Motion Generation, Reinforcement Learning, Text-to-Motion.
TL;DR: This paper presents Humanoid-R0, a method that uses RL to fine-tune a language-conditioned motion generation model, enabling stable and physically plausible long-horizon task execution on real humanoid robots.
Abstract: Motion generation models aim to create diverse and realistic motions from inputs such as text or keyframes, with applications in gaming, animation, and 3D content creation. Recently, generated motions have emerged as a viable supplement to reinforcement learning and teleoperation for humanoid control, overcoming the limitations of simulation diversity and bypassing the need for expensive demonstration collection. However, these downstream applications impose stringent requirements on motion quality—specifically, smoothness, skeletal plausibility, and keyframe consistency. To ensure generated motions meet these criteria, we quantify these metrics and convert them into reward functions for RL fine-tuning. Specifically, we fine-tune the motion generation component of HumanoidVLA using GRPO, resulting in Humanoid-R0, a model whose outputs are well-suited for robot control. Our approach is rigorously validated through extensive metric evaluations, simulation rendering, and real-world deployment, demonstrating significant improvements in motion smoothness, plausibility, and consistency. Notably, Humanoid-R0 enables stable execution of challenging sequences of consecutive commands on a real G1 robot, showcasing its enhanced capability for long-horizon task completion.
Primary Area: generative models
Submission Number: 19545
Loading