H-WM: Robotic Task and Motion Planning Guided by Hierarchical World Model
Keywords: Robot Task and Motion Planning, World Models, Robot Learning, Model Learning
TL;DR: We propose a hierarchical world model that jointly predicts symbolic and visual robot states, combining long-horizon planning robustness with visual grounding, and show its effectiveness on guiding VLA in long-horizon tasks
Abstract: World models are becoming central to robotic planning and control as they enable prediction of future state transitions. Existing approaches often emphasize video generation or natural-language prediction, which are difficult to directly ground in robot actions and suffer from compounding errors over long horizons. Traditional task and motion planning relies on symbolic-logic world models, such as planning domains, that are robot-executable and robust for long-horizon reasoning, but typically operate independently of visual perception, preventing synchronized symbolic and perceptual state prediction.
We propose a Hierarchical World Model (H-WM) that jointly predicts logical and visual state transitions within a unified bilevel framework. H-WM combines a high-level logical world model with a low-level visual world model, integrating the robot-executable, long-horizon robustness of symbolic reasoning with perceptual grounding from visual observations. The hierarchical outputs provide stable and consistent intermediate guidance for long-horizon tasks, mitigating error accumulation and enabling robust execution across extended task sequences. To train H-WM, we further introduce a robotic dataset that aligns robot motion with symbolic states, actions, and visual observations. Experiments across vision–language–action (VLA) control policies demonstrate the effectiveness and generality of the approach.
Submission Number: 123
Loading