MetaWorld: Skill Transfer and Composition in a Hierarchical World Model for Grounding High-Level Instructions

Published: 12 May 2026, Last Modified: 12 May 20262nd ViSCALE @ CVPR 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Humanoid Control,Hierarchical World Model,Vision-Language Models (VLM),Skill Composition,Symbol Grounding
Abstract: Humanoid loco-manipulation requires coordinating whole-body joints for simultaneous locomotion and object interaction, yet it remains challenging due to the high-dimensional action space. Current methods face three primary bottlenecks: (1) low sample efficiency in reinforcement learning, (2) poor generalization of imitation learning in long-horizon tasks, and (3) physical inconsistency in VLM-based planning. To address these, we propose MetaWorld, a hierarchical world model that integrates semantic planning and physical control via expert policy transfer. Specifically, (i) we introduce a motion prior fusion mechanism that leverages a pre-trained expert library within a compact latent dynamics model to accelerate online adaptation; (ii) a hierarchical task decoupling strategy is employed to decompose long-term logic from local execution, ensuring robust cross-scene transfer; and (iii) the VLM is utilized as a semantic interface to map high-level instructions directly to pre-validated physical skills, bypassing the symbol grounding problem and ensuring dynamic feasibility. Evaluated on challenging tasks in Humanoid-Bench, MetaWorld significantly outperforms TD-MPC2 and Dreamer V3 in both success rate and motion smoothness, achieving a 139.1\% increase in average reward.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 6
Loading