Offline RL learns a policy in a Markov decision process from a fixed dataset of previously collected trajectories, without any additional interaction with the environment. 
This setting arises naturally in domains where data is abundant but online interaction is constrained.
Examples include modern recommender and advertising systems, which operate on massive logs of user interactions; healthcare and other safety-critical decision problems, which cannot permit exploratory actions; robotics and control, which may be limited by real-world cost, wear, and safety; and autonomous driving and navigation, which rely typically on recorded datasets rather than unrestricted on-policy exploration.

The core challenge of offline RL is to optimize reward while avoiding actions that are unsupported by the data distribution \cite{lange2012batch,levine2020offline}. Offline RL algorithms address this challenge by shaping either the policy update or the value-learning objective: Behavior regularized actor critic (BRAC) approaches keep the learned policy close to the dataset behavior \cite{fujimoto2021td3bc,tarasov2023rebrac}; conservative Q learning (CQL) penalizes high Q-values on unlikely actions to reduce overestimation under dataset shift \cite{kumar2020cql}; and in-sample or implicit methods avoid querying out of distribution actions during improvement (e.g., IQL) \cite{kostrikov2022iql}. 
Separately, motivated by the success of generative models, many recent works replace simple parametric policies with generative models that capture multi-modal behavior in the dataset. These include sequence-modeling policies that predict actions conditioned on return (Decision Transformer) \cite{chen2021decisiontransformer} and diffusion or flow-based planners/policies that generate actions or trajectories by iterative denoising or flow dynamics (e.g., Diffuser, and recent flow-matching policy work) \cite{janner2022diffuser,lipman2023flowmatching}. For a detailed discussion on related works, see Appendix \ref{sec:related}.
\begin{figure*}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/MPCwDWM_block.pdf}
    \caption{Inference-time MPC with a diffusion world model. An offline dataset is used to train (i) a policy $\pi_{\psi}$ and terminal critic $Q_{\phi}$ (ii) a reward model $r_{\xi}$, and (iii) a diffusion-based dynamics sampler $f_{\theta}$. At inference time, starting from the current state $s_t$, we unroll multiple imagined rollouts by alternating policy actions $\tilde{a}_h=\pi_{\psi}(\tilde{s}_h)$ and diffusion transitions $\tilde{s}_{h+1}=f_{\theta}(\tilde{s}_h,\tilde{a}_h,\varepsilon_{t+h})$, evaluate a finite-horizon surrogate return (predicted rewards plus terminal value), and backpropagate through the differentiable rollout to update $\psi$ before executing the first action in the real environment. Green arrows indicate the forward rollout; red dashed arrows indicate gradient flow.}
\label{fig:mpc_dwm_flow}
\end{figure*}

Existing offline RL methods typically produce a single policy trained on the offline dataset and then deploy it ``as-is" at inference time. These methods do not explicitly exploit inference-time information about the particular states encountered by the agent beyond the policy's standard state input. 
Specifically, an offline learner returns a policy and an associated $Q$-value, intended to approximate the optimal policy and $Q$-values (see Section \ref{sec:prelim} for definitions). When the policy is deployed without further adaptation, performance depends primarily on how accurate these approximations are in the states and actions \emph{encountered at inference time}. We argue this is challenging for two reasons. First, estimating $Q$-value is intrinsically difficult because it represents a long-horizon quantity: it aggregates future rewards over many steps starting from a given state action pair. Second, in offline RL the critic must approximate the $Q$-value of the \emph{optimal} policy, even though the available data might be generated by a different behavior policy.
% ; this mismatch forces value learning to extrapolate beyond the dataset's action distribution. 

% In contrast, the one-step transition dynamics $P(s' \mid s,a)$ and reward function $r(s,a)$ are local objects: they depend only on the current state action pair, rather than on long-horizon estimates, and are independent of the optimal policy. This motivates learning world models of $P$ and $r$ from offline data and using them at inference time. Motivated by Model Predictive Control (MPC) \citep{CamachoBordons2007MPC,RawlingsMayneDiehl2017MPC}, our approach simulates short-horizon trajectories and adapts the pre-trained policy to the \emph{specific states encountered} by the agent. 
% \swcomment{I'm finding the preceding paragraph a bit confusing. Let me try to rewrite (but not sure if I am getting the facts right): "In the Markovian model, the one-step transition dynamics $P(s' \mid s,a)$ and reward function $r(s,a)$ are local objects: they depend only on the current state-action pair, rather than on long-horizon estimates, and are independent of whatever policy is being deployed.  This fact motivates a different approach: Learn world models of $P$ and $r$ from offline data and use these learned functions at inference time to formulate policy. The link between $P$ and $r$ and a corresponding optimal policy is provided by Model Predictive Control (MPC) \citep{CamachoBordons2007MPC,RawlingsMayneDiehl2017MPC}. 
% Conventional MPC optimizes action sequences over rolled-out short-horizon trajectories defined by the function $P$, thus implicitly generating a policy.
% We use MPC in a different way, using information gathered from the rollout simulations to update the parameters that define our policy.
% In this way, the initial policy learned from the offline data is updated continuously during deployment as new information is gathered about the system."}

In the Markovian model, the one-step transition dynamics $P(s' \mid s,a)$ and reward function $r(s,a)$ are local objects: they depend only on the current state-action pair, rather than on long-horizon estimates, and are independent of whatever policy is being deployed.  This fact motivates a different approach: Learn world models of $P$ and $r$ from offline data and use these learned functions at inference time to formulate policy. The link between $P$ and $r$ and a corresponding optimal policy is provided by Model Predictive Control (MPC) \citep{CamachoBordons2007MPC,RawlingsMayneDiehl2017MPC}. 
Conventional MPC optimizes action sequences over rolled-out short-horizon trajectories defined by the function $P$, thus implicitly generating a policy.
We use MPC in a different way, using information gathered from the rollout simulations to update the parameters that define our policy.
In this way, the initial policy learned from the offline data is updated continuously during deployment as new information is gathered about the system. We now describe our method in detail and outline our contributions.

%\begin{enumerate}[left=-0.4em]
\begin{squishenumerate}
    \item We introduce a \emph{Differentiable World Model} (DWM) pipeline that consists (i) a state-transition sampler, (ii) a reward model, and (iii) a terminal-value function for finite-horizon evaluation. 
    A key design choice is to make these components \emph{differentiable} w.r.t.~their inputs or conditionings, so the entire pipeline forms a computation graph that supports gradient-based optimization. See Section~\ref{sec:worldModel} for a detailed description.

    \item At inference time, we use the current state $s_t$ to generate multiple imagined rollouts by unrolling the differentiable dynamics under the current policy. We score these rollouts with a surrogate objective built from predicted rewards and a terminal-value function. We then update the policy parameters with gradient-based steps before executing one step of the resulting action in the real environment. See Figure~\ref{fig:mpc_dwm_flow} for a visual overview and Section~\ref{sec:MPCwDWM} for a detailed description of our method.

    \item We instantiate the differentiable state-transition sampler with a diffusion model and derive a policy update that backpropagates through imagined rollouts, expressing the gradient in terms of the policy Jacobians and the diffusion-dynamics Jacobians via a recursive chain rule (see Theorem ~\ref{thm:grad_recursion_diffusion_mpc}).

    \item We evaluate our algorithm on standard D4RL continuous control benchmarks \citep{fu2020d4rl}, including MuJoCo locomotion tasks (18 datasets) as well as the more challenging AntMaze environments (6 datasets). We show that our approach of leveraging inference time optimization using imagined rollouts consistently outperforms strong offline RL baselines (see Section ~\ref{sec:exp}).
\end{squishenumerate}
%\end{enumerate}

% \swcomment{I'm not totally sure what is going on. The last point above suggests that the next action is *not* chosen in the usual MPC style, by selecting the first action from the optimal action sequence computed from the short-horizon rollout. Rather, the simulated rollout information is used only to update the parameters that define the policy - THEN this policy is deployed to choose the action. Is that right?}

% \rdcomment{Yes that is correct. We use this updated policy to take the current action at time t, and at the next time step again perform the same procedure. This is different from the *usual* MPC approach and maybe we should specify this in our description. I will try and do that.}

% \swcomment{OK  I just changed my suggested para earlier in the intro.}

% \abcomment{it will be interesting to study (later) is there is any empirical difference between what we are doing vs the usual MPC approach.  }

\textbf{Comparison with existing World Model based Offline RL methods.} Although world models (including diffusion-based world models) have been used in \emph{offline} RL, they are typically leveraged in two ways. Some methods \citep{kidambi2020morel,yu2020mopo,yu2021combo,ding2024dwm} use the learned dynamics primarily during \emph{training}, generating imagined rollouts from the offline dataset to construct additional targets or synthetic experience for policy and value learning, including diffusion world models that model multi-step futures without step-by-step rollout. Other methods use a generative model at \emph{inference time} to produce candidates for future {optimal state trajectories} via sampling (often with guidance via return-conditioning), and then select an action by executing the first action of a sampled plan, without adapting the underlying policy parameters during inference \citep{ajay2023conditionalgen,janner2022diffuser,ki2025priorguided,yun2024gtg}. In contrast, our method uses the current observed state to \emph{adapt the policy parameters at inference time} by backpropagating through a differentiable world model over imagined finite-horizon rollouts.

% In contrast, in this work, we couple a pre-trained offline policy with an explicit model of the environment in order to \emph{adapt the policy at inference time}. Concretely, we learn (i) a state-transition world model, (ii) a reward model, and (iii) any auxiliary terminal-value component needed for finite-horizon evaluation, and we design these components to be end-to-end \emph{differentiable} so they can be used as a computation graph for optimization. \abcomment{next sentence is important, but can be hard to parse for a reader ... lets discuss, maybe rephrase or break into smaller sentences ... also see comments at end of para} At inference time, given the current observed state $s_t$, we sample multiple imagined rollouts over a finite horizon by unrolling the differentiable dynamics under the current policy, evaluate a surrogate objective constructed from predicted rewards (and optional terminal value), and then take gradient-based update steps on the policy parameters to maximize this objective before executing the resulting action in the real environment. This procedure leverages the current state to perform local, on-the-fly improvement while remaining grounded in the offline-pretrained policy and the learned differentiable world model.

% \abcomment{the above para can be done a bit differently, by bringing in the MPC intuition first, discussing how that can handle bad value functions, and what is needed from the generative world model to make this work}



% Although world models (including diffusion-based world models) have been used in \emph{offline} RL, they are typically leveraged in two ways. Some methods \citep{kidambi2020morel,yu2020mopo,yu2021combo,ding2024dwm} use the learned dynamics primarily during \emph{training}, generating imagined rollouts from the offline dataset to construct additional targets or synthetic experience for policy and value learning, including diffusion world models that model multi-step futures without step-by-step rollout. Other methods use a generative model primarily at \emph{inference time} to produce candidate future \emph{optimal state trajectories} via sampling (often with guidance via return-conditioning), and then select an action by executing the first action of a sampled plan, without adapting the underlying policy parameters during inference \cite{ajay2023conditionalgen,janner2022diffuser,ki2025priorguided,yun2024gtg}. In contrast, our method uses the current observed state to \emph{adapt the policy parameters at inference time} by backpropagating through a differentiable world model and reward model over imagined finite-horizon rollouts. \rdcomment{Our Contributions go here}

% \rdcomment{Need to talk about why we think using a world model at inference time would help improve the performance. - diminishing the reliance of $Q_{\phi}$. Models for $r$ and $P$ are not for the optimal policy $\pi^*$}

% \abcomment{such an approach will work better if the world models are ``better'' than the value functions learned from logged data}