\subsection{Offline Reinforcement Learning}
\label{sec:offline_rl}

We consider a discounted Markov decision process (MDP) specified by the tuple
$\mathcal{M}=\langle \mathcal{S},\mathcal{A},P,d_0,r,\gamma\rangle$,
where $\mathcal{S}$ and $\mathcal{A}$ are the state and action spaces, $P(s'\mid s,a)$ is the transition kernel, $d_0$ is the initial state distribution, $r(s,a)$ is the reward function, and $\gamma\in[0,1)$ is the discount factor.
A stochastic policy $\pi(a\mid s)$ induces a trajectory $\tau=(s_0,a_0,r_0,s_1,a_1,r_1,\ldots)$ with probability
$p^\pi(\tau)=d_0(s_0)\prod_{t\ge 0}\pi(a_t\mid s_t)P(s_{t+1}\mid s_t,a_t)$
and discounted return $R(\tau) \;=\; \sum_{t=0}^{\infty}\gamma^t r(s_t,a_t).$

The RL objective is to find a return-maximizing policy
\begin{align*}
\pi^{\star} \in \underset{\pi}{\argmax} ~~\mathbb{E}_{\tau\sim p^\pi}\!\left[R(\tau)\right]~.
\end{align*}

\paragraph{Temporal-difference learning.}
TD methods approximate the optimal action-value function
$Q^{\star}(s,a)=\mathbb{E}_{\tau\sim p^{\pi^{\star}}}[R(\tau)\mid s_0=s,a_0=a]$
with a parameterized critic $Q_\theta$ by minimizing a Bellman error on transition data:
\begin{align*}
\mathcal{L}_{\mathrm{TD}}(\theta)
=\!\!\!\!\!\!
\mathop{\mathbb{E}}\limits_{\substack{(s,a,r,s')\sim \mathcal{D}}}
\Big[
\big(r + \gamma \max_{a'\in\mathcal{A}} Q_\theta(s',a') - Q_\theta(s,a)\big)^2
\Big]~.
\end{align*}

For continuous action spaces, the maximization is typically implemented via a parameterized actor
$\pi_\phi(a\mid s)$, leading to a policy objective of the form
\begin{align*}
\mathcal{J}(\phi)
\;:=\;
\mathbb{E}_{s\sim \mathcal{D},\,a\sim \pi_\phi(\cdot\mid s)}\!\left[Q_\theta(s,a)\right]~.
\end{align*}

\paragraph{Offline RL and distribution shift.}
Offline RL learns a policy from a fixed dataset without additional environment interactions.
We assume access to a static dataset of transitions
$\mathcal{D}=\{(s_i,a_i,r_i,s_i')\}_{i=1}^{N}$ collected by an unknown behavior policy $\mu$.
% Denote the dataset-induced (empirical) behavior policy by $\pi^{\mathcal{D}}$.
A core difficulty is that naively applying TD learning and policy improvement can drive the learned policy
$\pi_\phi$ toward state-action regions that are weakly represented in $\mathcal{D}$, i.e., the induced
occupancy measure $d^{\pi_\phi}$ moves away from $d^{\mu}$.
Many offline RL methods address this distribution shift by enforcing an explicit constraint,
for example of the form
$D\!\left(d^{\pi_\phi}\,\middle\|\,d^{\mu}\right)\ \le\ \varepsilon,$
where $D$ is a divergence or discrepancy measure, incorporated directly into the learning procedure.
Such constrained formulations often introduce additional algorithmic heuristics to obtain stable and
competitive performance in practice.


\subsection{Diffusion Probabilistic Models}
\label{sec:prelim_diffusion}

% \abcomment{compress to half its current size, maybe inline most equations and cut down on the text ... this part is more to introduce notation rather than to explain the method in detail}

Diffusion probabilistic models \cite{sohl2015deep,ho2020denoising} define a latent-variable generative model by reversing a fixed forward diffusion (noising) process. Let $x^{0}\in\mathbb{R}^{d}$ denote a data sample. The model likelihood is defined by introducing latent variables $x^{1},\ldots,x^{K}$ and marginalizing them out:
\begin{align*}
p_{\theta}(x^{0})
\;:=\;
\int
p(x^{K})
\prod_{k=1}^{K} p_{\theta}(x^{k-1}\mid x^{k})\,
dx^{1:K},
\end{align*}
where $p(x^{K})$ is typically a standard Gaussian prior. The forward diffusion chain gradually corrupts the data by adding Gaussian noise according to a variance schedule $\{\beta_{k}\}_{k=1}^{K}$ i.e.,
$q(x^{1:K}\mid x^{0})
\;:=\;
\prod_{k=1}^{K} q(x^{k}\mid x^{k-1}),
$
with one-step transitions
$q(x^{k}\mid x^{k-1})
\;:=\;
\mathcal{N}\!\big(x^{k};
\sqrt{1-\beta_{k}}\,x^{k-1},\,
\beta_{k} I\big).$


The reverse (denoising) process is parameterized as a Gaussian with timestep-dependent covariance 
$p_{\theta}(x^{k-1}\mid x^{k})
\;=\;
\mathcal{N}\!\big(x^{k-1};
\mu_{\theta}(x^{k},k),\,
\Sigma_{k}\big)$, where $\mu_{\theta}(x^{k},k)$ is a neural network that maps the noisy latent $x^{k}$ and timestep $k$ to the mean of the reverse transition.
Although a tractable variational lower bound on $\log p_{\theta}(x)$ can be optimized to train diffusion models, in practice it is common to use the simplified surrogate objective proposed by \citet{ho2020denoising}, which trains the model to predict the injected noise.
Specifically, the predefined forward noising process is
\begin{align*}
q(x^{k}\mid x^{k-1})
\;:=\;
\mathcal{N}\!\Big(
x^{k};
\sqrt{\alpha_{k-1}}\,x^{k-1},\,
(1-\alpha_{k-1})I
\Big),
\end{align*}
and the learned reverse process is parameterized as
\begin{align*}
p_{\theta}(x^{k-1}\mid x^{k})
\;:=\;
\mathcal{N}\!\Big(
x^{k-1};
\mu_{\theta}(x^{k},k),\,
\Sigma_k
\Big),
\end{align*}
where $\{\alpha_{k}\}_{k=0}^{K-1}$ specifies the variance schedule, $x^{0} := x$ is a data sample, and $x^{K} \sim \mathcal{N}(0,I)$ for $K$ large enough.

The simplified denoising loss is
\begin{align*}
\mathcal{L}_{\mathrm{denoise}}(\theta)
\;:=\;
\mathbb{E}_
{\substack{
k\sim \mathrm{Unif}(\{1,\ldots,K\})\\
x^{0}\sim q, \;\;
\epsilon\sim \mathcal{N}(0,I)
}}
\Big[
\big\|
\epsilon - \epsilon_{\theta}(x^{k},k)
\big\|_2^2
\Big].
\end{align*}

where $\epsilon_{\theta}(x^{k},k)$ is a neural network that predicts the Gaussian noise $\epsilon$ added to obtain the noisy latent $x^{k}$.
This is equivalent to predicting the reverse-process mean, since $\mu_{\theta}(x^{k},k)$ can be expressed as a function of $\epsilon_{\theta}(x^{k},k)$ \cite{ho2020denoising}.


