

\label{sec:theoretical_analysis}
In this section, we analyze the theoretical limitations of applying standard Gaussian diffusion directly to simplex-constrained data (e.g., One-Hot labels). We verify two hypotheses corresponding to common baselines: first, that unconstrained Gaussian diffusion leads to a systematic boundary bias due to support mismatch; second, that strictly enforcing constraints step-wise renders the training objective intractable.

\subsection{Bias from Probability Leakage and Rectification}

\begin{proposition}
\label{prop:bias}
Standard DDPM defines the reverse process posterior as an unconstrained Gaussian distribution supported on $\mathbb{R}^C$. However, valid One-Hot data lies strictly on the simplex boundary. We show that the Gaussian assumption inherently allocates probability mass to invalid regions (\textit{Probability Leakage}), leading to a systematic bias when the unconstrained mean is used to estimate valid data.
\end{proposition}

\begin{proof}
\textbf{Step 1: The Unbounded Gaussian Posterior.}
By definition of the forward diffusion process, the true posterior $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ is derived as a standard Gaussian over the Euclidean space:
\begin{equation}
    q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t, \tilde{\beta}_t \mathbf{I})
\end{equation}
The standard MSE objective trains the model to approximate this unbounded mean $\tilde{\boldsymbol{\mu}}_t$.

\textbf{Step 2: Probability Leakage.}
Since the target $\mathbf{x}_0$ lies on the simplex (One-Hot), the valid signal must be non-negative ($\mathbf{x} \ge 0$). However, the Gaussian posterior spreads probability mass into the invalid negative half-space. We define this \textit{Probability Leakage} as:
\begin{equation}
    P_{\text{leak}} = \int_{-\infty}^{0} \mathcal{N}(x; \tilde{\mu}, \tilde{\beta}) \, dx > 0
\end{equation}
Because $P_{\text{leak}} > 0$, the standard Gaussian mean $\tilde{\boldsymbol{\mu}}_t$ is no longer a valid estimator for the constrained data.

\textbf{Step 3: Derivation of Systematic Bias.}
To recover a valid estimate, we must evaluate the expectation over the valid domain ($x \ge 0$). This is equivalent to calculating the first moment of a Rectified (Truncated) Gaussian, which normalizes the remaining probability mass ($1-P_{\text{leak}}$):
\begin{equation}
    \mathbb{E}_{\text{valid}}[x] = \frac{1}{1 - P_{\text{leak}}} \int_{0}^{\infty} x \cdot \mathcal{N}(x; \tilde{\mu}, \tilde{\beta}) \, dx
\end{equation}
Solving this integral reveals a shift from the original mean:
\begin{equation}
    \mathbb{E}_{\text{valid}}[x] = \tilde{\mu} + \underbrace{\sqrt{\tilde{\beta}} \cdot \lambda\left(\frac{-\tilde{\mu}}{\sqrt{\tilde{\beta}}}\right)}_{\text{Bias } \delta}
\end{equation}
where $\lambda(\cdot) = \frac{\phi(\cdot)}{1-\Phi(\cdot)}$ is the Inverse Mills Ratio. Note that the denominator $1-\Phi(\cdot)$ is exactly the valid probability mass ($1-P_{\text{leak}}$).

\textbf{Step 4: Conclusion.}
The standard DDPM minimizes error towards $\tilde{\boldsymbol{\mu}}_t$, ignoring the bias term $\boldsymbol{\delta}$ required to compensate for probability leakage. Since $\boldsymbol{\delta} > 0$, the model systematically underestimates the values needed to stay on the manifold, causing the generated samples to drift away from the simplex vertices (over-smoothing) towards the interior.
\end{proof}


\subsection{Intractability of Projected Diffusion}

\begin{proposition}
\label{prop:intractability}
An alternative baseline strategy is to strictly enforce simplex constraints via a normalization function $f(\cdot)$ (e.g., Softmax) at each forward diffusion step. We show that this approach breaks the Gaussian marginal property, causing the standard training objective to rely on a biased approximation.
\end{proposition}

\begin{proof}
\textbf{Step 1: The Projected Forward Process.}
Consider a process where a non-linear projection $f: \mathbb{R}^C \to \Delta^{C-1}$ is applied immediately after noise injection at every transition step:
\begin{equation}
    \mathbf{x}_t = f(\sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\boldsymbol{\epsilon}_t)
\end{equation}

\textbf{Step 2: Loss of Closed-Form Marginals.}
Standard DDPM efficiency relies on the Gaussian superposition property, allowing sampling of $\mathbf{x}_t$ directly from $\mathbf{x}_0$ in $O(1)$ time. However, in the projected process, the state $\mathbf{x}_t$ depends on $\mathbf{x}_{t-1}$, which is itself a non-linear function of $\mathbf{x}_{t-2}$. Unrolling this recursion yields a nested composition of non-linearities:
\begin{equation}
    \mathbf{x}_t = f\Big( \sqrt{1-\beta_t} f\big( \dots \big) + \sqrt{\beta_t}\boldsymbol{\epsilon}_t \Big)
\end{equation}
Unlike the linear case, these nested functions do not collapse into a single Gaussian distribution. Consequently, the marginal $q(\mathbf{x}_t|\mathbf{x}_0)$ has no analytical closed form, making the true posterior intractable.


\textbf{Step 3: Systematic Bias from Approximation Gap.}
To bypass this intractability, baselines typically approximate the objective by targeting the \textit{projection of the expectation}. However, the true optimal denoising target corresponds to the \textit{expectation of the projected states}.
Due to the strict non-linearity of $f$, the expectation operation does not commute with the projection mapping (i.e., $\mathbb{E}[f] \neq f(\mathbb{E})$). This introduces a systematic discrepancy $\mathcal{J}_t$:
\begin{equation}
    \mathcal{J}_t = \left\| \underbrace{\mathbb{E}_{\boldsymbol{\epsilon}} [ f(\mathbf{u}_t + \boldsymbol{\epsilon}) ]}_{\text{True Denoising Target}} - \underbrace{f( \mathbb{E}_{\boldsymbol{\epsilon}} [\mathbf{u}_t + \boldsymbol{\epsilon}] )}_{\text{Biased Proxy Target}} \right\| > 0
\end{equation}
where $\mathbf{u}_t$ denotes the pre-projection hidden state. We define this residue $\mathcal{J}_t$ as the \textbf{Approximation Gap} (analogous to Jensen's Gap). Standard diffusion training explicitly optimizes towards the {Biased Proxy Target}. Since $\mathcal{J}_t$ is non-zero, the model is optimizing a {mis-specified objective}, where the learned mean is structurally shifted away from the true data manifold. This error accumulates over timesteps, preventing convergence to the valid simplex distribution.
\end{proof}
