Keywords: VLA, Worldmodel, Test-Time-Training
TL;DR: WorldAgen is a unified VLA framework that combines world modeling and action prediction with lightweight test-time training, enabling adaptation to new environments.
Abstract: How can vision-language-action (VLA) models adapt to new environments where world dynamics shift?While recent research has combined world modeling and action prediction to improve VLA performance, existing methods largely rely on pretraining in static datasets, without mechanisms for active adaptation to new environments. As a result, these models often fail to generalize when deployed in unseen scenarios with novel object configurations or dynamics. We present WorldAgen, a unified framework that jointly learns world modeling and action prediction while enabling test-time training (TTT) to adapt to new environments. WorldAgen employs a shared Transformer backbone with two heads: (1) a world-model head that predicts future states from past state-action trajectories, and (2) an agent-model head that predicts actions conditioned on task instructions. During test time, WorldAgen samples exploratory actions, collects ground-truth state transitions, and performs lightweight TTT updates to refine its world model. This adaptation improves the model's understanding to the environments and leads to more accurate action predictions. Experiments on the CALVIN and LIBERO benchmarks demonstrate that our baseline model achieves comparable, and in some cases superior, performance to current state-of-the-art approaches. Moreover, with TTT on a small number of samples, our method surpasses existing state-of-the-art models, highlighting effectiveness of adapting world models at inference time.
Submission Type: Demo Paper (4-9 Pages)
NeurIPS Resubmit Attestation: This submission is not a resubmission of a NeurIPS 2025 submission.
Submission Number: 154
Loading