Keywords: World Models, Video Diffusion Models, Zero-Shot Policy Learning, Embodied Intelligence
TL;DR: PEWM tackles embodied AI’s data bottleneck by generating short clips of primitive actions—enabling precise language-action alignment, better efficiency, and compositional control via VLM + goal guidance. A step toward “scalable robotic learning.”
Abstract: While video-generation-based embodied world models have gained increasing
attention, their reliance on large-scale embodied interaction data remains a key
bottleneck. The scarcity, difficulty of collection, and high dimensionality of em-
bodied data fundamentally limit the alignment granularity between language and
actions and exacerbate the challenge of long-horizon video generation—hindering
generative models from achieving a "GPT moment" in the embodied domain. There
is a naive observation: the diversity of embodied data far exceeds the relatively
small space of possible primitive motions. Based on this insight, we propose a novel
paradigm for world modeling–Primitive Embodied World Models (PEWM). By
restricting video generation to fixed short horizons, our approach 1) enables fine-
grained alignment between linguistic concepts and visual representations of robotic
actions, 2) reduces learning complexity, 3) improves data efficiency in embodied
data collection, and 4) decreases inference latency. By equipping with a modular
Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mech-
anism (SGG), PEWM further enables flexible closed-loop control and supports
compositional generalization of primitive-level policies over extended, complex
tasks. Our framework leverages the spatiotemporal vision priors in video mod-
els and the semantic awareness of VLMs to bridge the gap between fine-grained
physical interaction and high-level reasoning, paving the way toward scalable,
interpretable, and general-purpose embodied intelligence.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 9943
Loading