\section{Methodology}
\subsection{Problem Formulation: Simplex-Aligned Manifold}
We formulate the classification task as a conditional generative process. Let $\mathcal{D} = \{(\mathbf{x}, \mathbf{y})\}$ be the dataset, where $\mathbf{y} \in \Delta^{C-1}$ is the one-hot label on the probability simplex.

\textbf{Geometric Conflict in One-Hot Diffusion.} 
Standard diffusion models define the forward process on $\mathbf{y}$ as a Gaussian transition:
\begin{equation}
    \mathbf{y}_t = \sqrt{\bar{\alpha}_t}\mathbf{y} + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).
\end{equation}
Since $\mathbf{y}$ is bounded (sparse and non-negative) while $\boldsymbol{\epsilon}$ is unbounded, the noisy state $\mathbf{y}_t$ inevitably falls outside the valid simplex (i.e., $\mathbf{y}_t \notin \Delta^{C-1}$), rendering the geometric structure ill-defined during generation. We provide a mathematical proof in Appendix~\ref{sec:theoretical_analysis}, demonstrating that this mismatch introduces a systematic bias (Proposition~\ref{prop:bias}) and renders standard training objectives intractable (Proposition~\ref{prop:intractability}).

\noindent\textbf{Intuition for the Geometric Conflict.}
Intuitively, this support mismatch between unbounded Gaussian noise and the bounded probability simplex leads to two fundamental hazards (see detailed proofs in Appendix~\ref{sec:theoretical_analysis}):
\begin{itemize}
    \item \textbf{Probability Leakage (Proposition~\ref{prop:bias}):} {Since Gaussian noise is defined on $\mathbb{R}^C$, the diffusion process inevitably pushes noisy states into ``invalid" regions (e.g., values $<0$ or $>1$). This causes a systematic {boundary bias} where the model underestimates the values needed to reach the simplex vertices, leading to over-confident yet miscalibrated ``over-smoothed" predictions.}
    {\item \textbf{Approximation Gap (Proposition~\ref{prop:intractability}):} Simply forcing states back onto the simplex (e.g., via Softmax) at each step breaks the linear superposition property of Gaussian diffusion. This renders the training objective a biased proxy (Jensen's Gap), preventing the model from converging to the true data manifold.}
\end{itemize}
To resolve these, we propose performing diffusion on the unconstrained logit manifold $\mathbf{z}_0$, which is naturally compatible with Gaussian assumptions.

\textbf{Simplex-Aligned Logit Diffusion.}
To resolve this, we propose performing diffusion on a diffeomorphic continuous manifold. We map $\mathbf{y}$ to a centered logit state $\mathbf{z}_0$ via a scaled Center Log-Ratio (CLR) transformation:
\begin{equation}
    \mathbf{z}_0 = \mathcal{T}(\mathbf{y}) = \frac{1}{\lambda} \left( \log(\mathbf{y}_{\text{smooth}}) - \frac{1}{C}\sum_{k=1}^{C}\log(y_{\text{smooth}}^{(k)}) \right).
\end{equation}
Here, $\mathbf{z}_0 \in \mathbb{R}^C$ resides in an unconstrained Euclidean space compatible with Gaussian noise. The forward process $\mathbf{z}_t = \sqrt{\bar{\alpha}_t}\mathbf{z}_0 + \boldsymbol{\epsilon}$ is now mathematically consistent, preserving the manifold structure throughout the diffusion chain. We utilize label smoothing ($\mathbf{y}_{\text{smooth}}$ with $\epsilon=0.001$) to handle numerical singularities.
\subsection{Transformer-Enhanced Visual Tokenizer}
To provide context-aware guidance, we adopt the dual-stream feature extraction paradigm from DiffMIC-v2~\cite{yang2025diffmic} but significantly enhance the feature interaction mechanism using a Transformer encoder.

\textbf{Dual-Stream Extraction.} 
The {Global Stream} uses an encoder $E_g$ to extract a holistic token $\mathbf{f}_g$ and a spatial saliency map $\mathbf{S}$. Guided by $\mathbf{S}$, the {Local Stream} extracts $K$ discriminative patches and encodes them into regional tokens $\{\mathbf{f}_l^k\}_{k=1}^K$. 
We explicitly extract the global prior $\mathbf{v}_g$ and local prior $\mathbf{v}_l$ (logits) directly from these streams before interaction. These priors represent the initial predictions from global and local views, respectively.

\textbf{Cross-Granularity Interaction.}
Existing methods often combine global and local features via simple concatenation or static fusion. To explicitly model the dynamic dependency between the holistic context and local nuances, we construct a unified sequence $\mathbf{Z}^{(0)} = [\mathbf{f}_g, \mathbf{f}_l^1, \dots, \mathbf{f}_l^K]$. We feed this sequence into a Transformer Encoder Layer to perform deep interaction:
\begin{equation}
    \mathbf{Z}' = \text{LN}(\mathbf{Z}^{(0)} + \text{MSA}(\mathbf{Z}^{(0)})), \qquad
    \mathbf{Z}^{(1)} = \text{LN}(\mathbf{Z}' + \text{FFN}(\mathbf{Z}')).
\end{equation}
% The FFN consists of two linear transformations with a GELU activation: $\text{FFN}(\mathbf{x}) = \mathbf{W}_2(\sigma(\mathbf{W}_1\mathbf{x}))$. 
% Through this mechanism, the global token queries details from local tokens, while local tokens are calibrated by the global context.

\textbf{Output Formulation.}
The tokenizer yields two refined conditions for the diffusion process:
1) The first token of $\mathbf{Z}^{(1)}$ is projected to obtain the {fusion prior} $\mathbf{v}_{\text{trans}}$ (used for loss weighting).
2) The subsequent tokens form the \textbf{Refined Semantic Features} $\mathbf{F}_{\text{ref}} = \mathbf{Z}^{(1)}_{1:K+1}$, which serve as the high-level semantic condition injected into the UNet.

\subsection{Generative Process with Refined Semantic Feature Injection}

We model classification as a reverse diffusion process refining $\mathbf{z}_t$ to $\mathbf{z}_0$. First, following~\cite{yang2025diffmic}, we concatenate a spatial guidance map $\mathcal{M}$ (derived from $\mathbf{v}_g, \mathbf{v}_l$) with $\mathbf{z}_t$. 
Second, for semantic injection, we inject the Transformer-derived Refined Semantic Features $\mathbf{F}_{\text{ref}}$ into the U-Net via an {Adaptive Channel Gating} mechanism. $\mathbf{F}_{\text{ref}}$ is projected and fused with the intermediate feature map $\mathbf{h}$ of the U-Net to compute a channel-wise gating weight $\mathbf{w}$. The feature map is modulated as $\mathbf{h}' = \mathbf{h} \cdot \text{Softmax}(\mathbf{w})$.

\textbf{Optimization.}
The diffusion model $\boldsymbol{\epsilon}_\theta$ predicts the noise $\boldsymbol{\epsilon}$. The training objective is a re-weighted MSE loss:
\begin{equation}
    \mathcal{L}_{\boldsymbol{\epsilon}} = \mathbb{E}_{t, \mathbf{z}_0, \boldsymbol{\epsilon}} \left[ \omega(\mathbf{v}_{\text{trans}}) \cdot \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathcal{M}, \mathbf{F}_{\text{ref}}) \|^2 \right],
\end{equation}
where $\omega(\cdot)$ is a focal term derived from the Transformer's fusion prior $\mathbf{v}_{\text{trans}}$, enforcing focus on hard samples.


\textbf{Reweighting of the Diffusion Loss.}
In Eq.~(4), the weighting function $\omega(\cdot)$ implements a focal-style sample reweighting mechanism applied directly to the diffusion noise regression objective.
Unlike introducing an auxiliary classification loss, this term modulates the contribution of each training sample within the diffusion denoising loss itself, based on the model's confidence in the ground-truth class.

Specifically, let $\mathbf{p} = \mathrm{Softmax}(\mathbf{v}_{\mathrm{trans}})$ denote the class probability vector obtained from the Transformer fusion prior $\mathbf{v}_{\mathrm{trans}}$, and let $p_y$ represent the predicted probability assigned to the true class $y$.
The weighting function is defined as
\begin{equation}
\omega(\mathbf{v}_{\mathrm{trans}}) = 1 + \alpha (1 - p_y)^{\gamma},
\end{equation}
where $\alpha$ and $\gamma$ are scalar hyperparameters controlling the overall strength and focusing behavior of the reweighting, respectively.

This formulation assigns larger weights to low-confidence (hard) samples while down-weighting high-confidence (easy) samples, while maintaining a minimum weight of $1$.
As a result, the diffusion model is encouraged to allocate more capacity to ambiguous or challenging cases without destabilizing training.
Importantly, this focal-style weighting operates solely on the diffusion noise prediction error and does not introduce an additional discriminative objective.
We therefore preserve the generative nature of the framework while improving optimization focus on difficult samples.

\subsection{Inference Procedure}
During inference, we initiate the process by sampling a random Gaussian noise $\mathbf{z}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. We then iteratively denoise this state to recover the estimated logit vector $\hat{\mathbf{z}}_0$ via the reverse diffusion process. This generation is conditioned on the structural guidance map $\mathcal{M}$ and the refined semantic features $\mathbf{F}_{\text{ref}}$, ensuring the generated logits are semantically consistent with the input image. Finally, the discrete class probability is recovered as $\hat{\mathbf{p}} = \text{Softmax}(\lambda \cdot \hat{\mathbf{z}}_0)$, serving as the final classification prediction.

\begin{figure*}[t] 
  \floatconts
    {fig:method} 
    {
    % \vspace{-2mm}
    \caption{Overview of the proposed framework. 
    The method builds upon a dual-stream backbone enhanced by two key innovations: 
    (1) \textbf{Transformer-Enhanced Interaction}: Global and local features are fused via a Transformer layer to yield refined semantic features $\mathbf{F}_{\text{ref}}$ and a fusion prior. Explicit priors $\mathbf{v}_g, \mathbf{v}_l$ from the backbone are used to construct the spatial map.
    (2) \textbf{Simplex-Aligned Diffusion}: Unlike standard approaches operating on discrete one-hot vectors, our model operates on the continuous logit manifold $\mathbf{z}_0$, receiving structural guidance from $\mathcal{M}$ and semantic guidance from $\mathbf{F}_{\text{ref}}$ to iteratively denoise the target logits.}
    }
    {
    % \vspace{-3mm}
      \includegraphics[width=0.81\linewidth]{method.pdf}
      % \vspace{-5mm}
    }
    % \vspace{-7mm}
\end{figure*}