

% \abcomment{can start by discussing what is a differential generative world model (DiffGenWM) ... we need samples and gradients wrt input. Further, we should explain any DiffGenWW would do, we focus on basic diffusion models etc.}

Our algorithm at inference time requires sampling next state samples conditioned on the current state action pair, and differentiating those samples with respect to the conditioning variables. We refer to any model with this sample plus gradient interface as a differentiable generative world model (DiffGenWM), and in this work we implement it using a conditional diffusion model. In addition to the DiffGenWM module, our world model includes a differentiable reward model and a terminal value function given by the critic of a pretrained policy. Concretely, we build a world model using the offline dataset of trajectories $\mathcal{D} = \{(s_t,a_t,r_t,s_{t+1})\}_{t=1}^N$, consisting of the following components.
\vspace{1.5ex}
\begin{squishenumerate}
    \item A \emph{Differentiable Diffusion Sampler} $f_{\theta}$ to simulate the transition dynamics.
    \item A reward model $r_{\xi}$ to learn the reward of a given state action pair.
    \item A pretrained policy $\pi_{\psi}$ and the corresponding pretrained critic value $Q_{\phi}$.
\end{squishenumerate}
Next we describe all these components in detail.

\subsection{Differentiable Diffusion Sampler}
Our objective is to learn a parametric differentiable diffusion sampler that conditioned on a given state-action pair $s_t, a_t$ at time $t$ and sampled noise $\epsilon_{t}$ generates the next state $s_{t+1}$:
\begin{align*}
s_{t+1} = f_{\theta}(s_t, a_t, \varepsilon_t),
\quad \varepsilon_t \sim p_0(\varepsilon),
\end{align*}
Here $f_{\theta}$ is the \emph{reverse diffusion sampler}, written as a deterministic computation graph, and $\varepsilon_t$ is a collection of Gaussian random variables used in the generation process. Equivalently, the diffusion model specifies a conditional distribution over next states, $p_{\theta}(s_{t+1}\mid s_t,a_t)$, together with a reparameterized sampling procedure $s_{t+1}=f_{\theta}(s_t,a_t,\varepsilon_t)$ whose randomness is isolated in $\varepsilon_t$.

% \abcomment{quite a bit of redundancy here, given material in Section 2}

We learn $p_{\theta}(s_{t+1}\mid s_t,a_t)$ from an offline dataset of transitions $\mathcal{D} = \{(s_t,a_t,r_t,s_{t+1})\}_{t=1}^N.$
The diffusion model is trained to represent the conditional law of $s_{t+1}$ given $(s_t,a_t)$ by introducing a \emph{forward noising process} on $s_{t+1}$ and a learned \emph{reverse denoising process} that inverts this noising when conditioned on $(s_t,a_t)$. At sampling time, the reverse process induces the map $f_{\theta}$.

\subsubsection{Forward process}

Fix a diffusion horizon $K\in\mathbb{N}$ and a variance schedule $\{\alpha_k\}_{k=1}^K \subset (0,1)$. For each transition tuple $(s_t,a_t,s_{t+1})\in\mathcal{D}$, define a Markovian forward process that progressively corrupts the next state as follows:
\begin{align*}
s_{t+1}^{(0)} &= s_{t+1}, \\
s_{t+1}^{(k)} \mid s_{t+1}^{(k-1)}
&\sim q\big(s^{(k)} \mid s^{(k-1)}\big) \\
& = \mathcal{N}\big(\sqrt{\alpha_k}\, s_{t+1}^{(k-1)}, (1-\alpha_k) I\big),
 k=1,\dots,K.
\end{align*}
Let $\bar\alpha_k := \prod_{j=1}^k \alpha_j$. This choice implies a closed-form marginal for any noise level $k$:
\begin{align*}
q\big(s_{t+1}^{(k)} \mid s_{t+1}\big)
&=
\mathcal{N}\big(\sqrt{\bar\alpha_k}\, s_{t+1}, (1-\bar\alpha_k)I\big).
\end{align*}
In particular, one can write a reparameterized sample from the marginal as
\begin{align*}
s_{t+1}^{(k)} = \sqrt{\bar\alpha_k}\, s_{t+1} + \sqrt{1-\bar\alpha_k}\,\epsilon,
\qquad \epsilon \sim \mathcal{N}(0,I).
\end{align*}

\subsubsection{Conditional reverse process}

The reverse process aims to sample $s_{t+1}$ conditioned on $(s_t,a_t)$ by iteratively denoising from a Gaussian reference distribution at level $K$. Let the conditioning be $c_t := (s_t,a_t).$ At sampling time, initialize from a base noise variable $s_{t+1}^{(K)} \sim \mathcal{N}(0,I)$, and apply a learned reverse transition for $k=K-1,\dots,1$:
\begin{align*}
s_{t+1}^{(k-1)}
&= \frac{1}{\sqrt{\alpha_k}}
  \Big(
    s_{t+1}^{(k)}
    - (1-\alpha_k)\, \hat\epsilon_{\theta}\big(s_{t+1}^{(k)}, k, c_t\big)
  \Big)
  + \sigma_k z_{k-1},
\\ 
z_{k-1} &\sim \mathcal{N}(0,I).
\end{align*}
Here $\hat\epsilon_{\theta}(\cdot,k,c_t)$ is a parametric predictor of the noise component at level $k$, and $\{\sigma_k\}$ specifies the reverse-process variance. The final denoised sample is $s_{t+1}^{(0)} \sim p_{\theta}(\cdot \mid s_t,a_t)$, and we define the reverse sampler $f_{\theta}$ by collecting all Gaussian random variables used by the reverse procedure into $\varepsilon_t := (z_K,z_{K-1},\dots,z_0)$, so that the sampled next state can be written as $s_{t+1}^{(0)} = f_{\theta}(s_t,a_t,\varepsilon_t)$.


We define the initialization map $g_K:\mathbb{R}^d\to\mathbb{R}^d$ and the reverse-step map
$h_k:\mathbb{R}^d\times\mathbb{R}^d\times\mathbb{R}^m\times\mathbb{R}^d\to\mathbb{R}^d$ by
\begin{align}
g_K(z) &:= z,\nonumber\\
h_k(u,s,a,z)
&:=
\frac{1}{\sqrt{\alpha_k}}
\Big(
u-(1-\alpha_k)\,\hat\epsilon_{\theta}\big(u,k,(s,a)\big)
\Big)\nonumber
\\&\qquad +\sigma_k z,\;\;\;\;\;
\text{for } k=1,\ldots,K-1
\label{eq:g_h_def}
\end{align}
so that $s_{t+1}^{(k-1)}\!\!=\!h_k(s_{t+1}^{(k)},s_t,a_t,z_{k-1})$ and $s_{t+1}^{(K)}=g_K(z_K)$.

\paragraph{Deterministic computation graph for fixed noise.}
For any fixed realization of $\varepsilon_t$, the mapping $(s_t,a_t)\mapsto s_{t+1}^{(0)}$ is a deterministic composition of (i) linear operations, (ii) evaluations of the denoiser $\hat\epsilon_{\theta}(\cdot,k,c_t)$ at each reverse step, and (iii) additive terms determined by the fixed Gaussian draws. Consequently, $f_{\theta}$ is a differentiable computation graph in its inputs $(s_t,a_t)$ for fixed $\varepsilon_t$.

\subsubsection{Learning objective}

We train $\hat\epsilon_{\theta}$ to invert the forward corruption of $s_{t+1}$, conditioned on $(s_t,a_t)$. Using the marginal reparameterization at a randomly chosen diffusion level $k$, we form
\begin{align*}
s_{t+1}^{(k)} = \sqrt{\bar\alpha_k}\, s_{t+1} + \sqrt{1-\bar\alpha_k}\,\epsilon,
\qquad \epsilon \sim \mathcal{N}(0,I),
\end{align*}
and minimize the conditional noise-prediction error
\begin{align*}
\mathcal{L}(\theta)
&=
\mathbb{E}_{\substack{(s_t,a_t,s_{t+1})\sim \mathcal{D}\\ k \sim \mathrm{Unif}(\{1,\dots,K\})\\ \epsilon \sim \mathcal{N}(0,I)}}
\Big[
\big\|
\epsilon - \hat\epsilon_{\theta}\big(s_{t+1}^{(k)}, k, c_t\big)
\big\|_2^2
\Big].
\end{align*}
Intuitively, this objective teaches the denoiser to recover the injected Gaussian noise at arbitrary noise levels while leveraging $(s_t,a_t)$ as side information. After training, the resulting reverse process defines a conditional generative model $p_{\theta}(s_{t+1}\mid s_t,a_t)$, and its sampling procedure is precisely the transition map
$s_{t+1} = f_{\theta}(s_t,a_t,\varepsilon_t), \varepsilon_t \sim p_0(\varepsilon)$,
which we treat as a learned, differentiable simulator of one-step dynamics.


\subsection{Reward model}

In addition to the transition model, we learn a parametric reward predictor from the same offline data. We parameterize the reward predictor as a function
$r_{\xi} : \mathcal{S}\times\mathcal{A}\to\mathbb{R}$,
which maps a state--action pair $(s_t,a_t)$ to a scalar reward estimate. Given an offline dataset of transitions $\mathcal{D}=\{(s_t,a_t,r_t,s_{t+1})\}_{t=1}^N$, we fit $r_{\xi}$ by supervised regression on observed rewards:
\begin{align*}
\min_{\xi}\;\mathcal{L}_{r}(\xi)
:=
\mathbb{E}_{(s_t,a_t,r_t,\cdot)\sim\mathcal{D}}
\Big[
\big(r_{\xi}(s_t,a_t)-r_t\big)^2
\Big].
\end{align*}
After training, $r_{\xi}$ serves as a differentiable reward oracle that can be queried at arbitrary $(s,a)$ pairs produced by downstream planning or policy optimization, and can be combined with the learned diffusion transition model to form multi-step return objectives.
\subsection{Policy and Terminal Value}

Given an offline dataset $\mathcal{D}=\{(s_t,a_t,r_t,s_{t+1})\}_{t=1}^N$, we learn a parametric policy $\pi_{\psi}(s)$ and a parametric action-value function $Q_{\phi}(s,a)$. We adopt a Behavior-Regularized Actor Critic (BRAC) style objective from \cite{wu2019brac} and follow the practical architectural and training choices from ReBRAC \cite{tarasov2023minimalist}. The critic is trained by one-step temporal-difference regression with a target network $Q_{\bar{\phi}}$:
\begin{align*}
\mathcal{L}_{Q}(\phi)
:=
\mathbb{E}_{\substack{(s,a,r,s')\sim \mathcal{D}\\ a'\sim \pi_{\psi}(\cdot\mid s')}}
\Big[
\big(Q_{\phi}(s,a) - r - \gamma Q_{\bar{\phi}}(s',a')\big)^2
\Big].
\end{align*}
The actor is trained to select actions with high value under $Q_{\phi}$, while remaining close to the dataset behavior by a behavior-cloning regularizer that increases the likelihood of dataset actions under $\pi_{\psi}$:
\begin{align*}
\mathcal{L}_{\pi}(\psi)
:=
\mathbb{E}_{\substack{(s,a)\sim \mathcal{D}\\ a^{\pi}\sim \pi_{\psi}(\cdot\mid s)}}
\Big[
- Q_{\phi}(s,a^{\pi})
\;-\;
\alpha \log \pi_{\psi}(a\mid s)
\Big].
\end{align*}
where $\alpha>0$ controls the strength of behavior regularization and $\gamma\in(0,1)$ is the discount factor. In practice, $Q_{\bar{\phi}}$ is maintained as a slowly updated copy of $Q_{\phi}$ to stabilize the bootstrapped target in $\mathcal{L}_{Q}(\phi)$.






