Keywords: offline reinforcement learning, all-in-one world model, online planning, model predictive control, goal reaching
TL;DR: TL;DR: Use an offline pretrained masked Transformer — which doubles as both a policy and world model — to do MPC planning at test time, boosting performance with zero extra training. Extends to online finetuning and goal-reaching.
Abstract: Recent work in Offline Reinforcement Learning (RL) has shown that an
all-in-one world model pretrained offline via a masked auto-encoding
objective can effectively capture the relationships between different
modalities (e.g., states, actions, rewards) within trajectory datasets.
However, this model's full potential has not been exploited during
deployment, where the agent must generate an optimal policy rather than
merely reconstruct masked tokens. Since the pretrained model subsumes
both a Policy Model and a World Model under appropriate mask patterns,
we propose leveraging it for \textit{online planning} via Model Predictive
Control (MPC) at test time, using the model's own predictive capability
to guide action selection. Empirical results on D4RL and RoboMimic show
that our online planning framework significantly improves decision-making
performance of the pretrained model without any additional parameter
training. Furthermore, the framework extends naturally to
Offline-to-Online (O2O) RL and Goal-Reaching RL, yielding more
substantial gains when an online interaction budget is available and
better generalization when diverse task targets are specified.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 21
Loading