Keywords: robot policy learning, offline reinforcement learning, whole-body control
Abstract: Scaling imitation learning to high-DoF whole-body robots is fundamentally limited by the \textbf{curse of dimensionality} and the prohibitive cost of collecting expert demonstrations. We argue that the core bottleneck is paradigmatic: real-world supervision for whole-body control is inherently imperfect, yet most methods assume expert data. To overcome this, we propose \textbf{HVD} (Hierarchical Value-Decomposed Offline Reinforcement Learning), an offline reinforcement learning framework that learns effective policies directly from suboptimal, reward-labeled trajectories. HVD structures the value function along the robot’s kinematic hierarchy and over temporal chunks, enabling precise credit assignment in long-horizon, high-dimensional tasks. Built on a Transformer-based architecture, HVD supports \textit{multi-modal} and \textit{multi-task} learning, allowing flexible integration of diverse sensory inputs for policy learning. To enable realistic evaluation and training, we further introduce \textbf{WB-50}, a 50-hour dataset of teleoperated and policy rollout trajectories annotated with rewards and preserving natural imperfections — including partial successes, corrections, and failures. Experiments show HVD significantly outperforms existing baselines in success rate across complex whole-body tasks. Our results suggest that effective policy learning for high-DoF systems can emerge not from perfect demonstrations, but from structured learning over realistic, imperfect data.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 18935
Loading