Keywords: World Model, VLA, Humanoid Robotics
TL;DR: We propose FLARE, a conceptually simple and lightweight framework for joint robot policy learning and latent world modeling.
Abstract: We introduce **F**uture **LA**tent **R**presentation Alignm**E**nt (**FLARE**), a novel framework that integrates predictive world modeling into robot policy learning.
By aligning features from a diffusion transformer with latent embeddings of future observations, **FLARE** enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions.
Remarkably lightweight, **FLARE** requires only minimal architectural modifications---adding a few tokens to standard vision-language-action (VLA) models---yet delivers substantial performance gains.
Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, **FLARE** achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26\%.
Moreover, **FLARE** unlocks the ability to co-train with human egocentric video demonstrations lacking action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as 1 robot demonstration.
Our results establish **FLARE** as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
Supplementary Material: zip
Submission Number: 951
Loading