Lifting Ego World Models for Planning and Control

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: World models, planning, policy, egocentric, humanoid, diffusion, video generation, hierarchical, generative models
TL;DR: We train a goal-conditioned policy to generate actions in human joint space. Then we combine the policy with a low-level world model, lifting its action space and show improved effectiveness and efficiency in search-based planning.
Abstract: World models have shown remarkable ability to predict future observations from high-dimensional action inputs, but planning in complex action spaces like human joint movement remains a difficult and unsolved problem. Inspired by hierarchical control in humans, we design a goal-conditioned controller policy to generate low-level joint actions conditioned on high-level waypoint inputs. Leveraging waypoint goal-conditioning and short-term motion patterns, we combine our policy with a low-level PEVA world model, lifting it its input to the high-level waypoint space. First, we show that waypoint goal conditioning improves Mean Joint Error (MJE) for a human-like agent by $5.8\times$ while being easily controllable and generalizing to unseen actions. Next, we perform visuomotor planning with the lifted PEVA world model for hybrid navigation-interaction tasks in the Nymeria dataset, improving MJE by up to $4.7\times$, while being more efficient and generalizing to entirely unseen environments.
Submission Number: 60
Loading