Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment.
Such methods train an offline policy (or value function), and apply it at inference time without further refinement.
We introduce an inference time adaptation framework inspired by model predictive control (MPC) that utilizes a pretrained policy along with a learned world model of state transitions and rewards. 
While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to \emph{optimize} the policy parameters on the fly.
In contrast, our design is a \emph{Differentiable World Model} (DWM) pipeline that enables end-to-end gradient computation through imagined rollouts for {\em policy optimization at inference time} based on MPC. 
We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines.
\vspace{-1ex}

% At each decision step, given the current state, we generate multiple finite-horizon imagined trajectories, maximize a surrogate objective defined over predicted rewards (and pre-trained terminal value), and update the policy parameters via gradient steps before executing the resulting action. 
% We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), demonstrating consistent gains over strong offline RL baselines.
% \rdcomment{Need to talk about experiments}
