FLARE: Robot Learning with Implicit World Modeling

Published: 25 Jun 2025, Last Modified: 25 Jun 2025Dex-RSS-25EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Latent World Model, VLA, Humanoid Robotics
TL;DR: We propose FLARE, a conceptually simple and lightweight framework for joint robot policy learning and latent world modeling.
Abstract: We introduce **F**uture **LA**tent **RE**presentation Alignment (**FLARE**), a novel framework that integrates predictive world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, **FLARE** enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, **FLARE** requires only minimal architectural modifications---adding a few tokens to standard vision-language-action (VLA) models---yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, **FLARE** achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, **FLARE** unlocks the ability to co-train with human egocentric video demonstrations lacking action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as **1** robot demonstration. Our results establish **FLARE** as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
Submission Number: 5
Loading