
\section{Experiments}
% \subsection{Setup}
% We evaluate our proposed method on two medical image classification benchmarks with diverse modalities: 
% (1) {APTOS2019}~\cite{aptos2019-blindness-detection} for diabetic retinopathy grading, consisting of 3,662 fundus images categorized into 5 severity levels.
% (2) {HAM10000}~\cite{tschandl2018ham10000} for skin lesion diagnosis, containing 10,015 dermatoscopic images across 7 diagnostic categories.
% For both datasets, we follow the official or standard splits for training and testing to ensure fair comparison. We provide more implementation details in Appendix~\ref {app:implementation}. We employ Accuracy (Acc), Macro F1-score (F1), and Cohen's Kappa ($\kappa$) ~\cite{cohen1960coefficient} as the primary metrics. F1-score is particularly emphasized to evaluate performance on imbalanced classes, while Kappa measures the inter-rater agreement for ordinal grading tasks. We provide definition of metrics in Appendix~\ref{app:math_definitions}.
We evaluate our method on two public benchmarks: \textbf{APTOS2019}~\cite{aptos2019-blindness-detection} (Retina) and \textbf{HAM10000}~\cite{tschandl2018ham10000} (Dermatoscopy). Following standard protocols~\cite{yang2025diffmic}, we employ Accuracy (Acc), Macro F1-score (F1), and Cohen's Kappa ($\kappa$)~\cite{cohen1960coefficient} as evaluation metrics. Detailed dataset statistics, data splitting strategies, and mathematical definitions of these metrics are provided in Appendix~\ref{app:implementation} and Appendix~\ref{app:math_definitions}.

\subsection{Main Results}
Table \ref{tab:combined_results} presents the quantitative comparison against state-of-the-art methods. We benchmark against specialized imbalanced learning strategies, including LDAM~\cite{cao2019learning}, OHEM~\cite{shrivastava2016training}, MTL~\cite{liao2017deep}, DANIL~\cite{gong2020distractor}, CL~\cite{marrakchi2021fighting}, and ProCo~\cite{yang2022proco}, as well as advanced architectures such as DGCMM~\cite{wang2022deep}, UniFormer~\cite{li2023uniformer}, and the diffusion-based DiffMIC-v2~\cite{yang2025diffmic}.

On the \textbf{HAM10000} dataset, our method achieves the best performance across both metrics, surpassing the strongest baseline DiffMIC-v2 by {1.1\%} in Accuracy (0.894 vs. 0.883) and {0.3\%} in F1-Score. This indicates that our Simplex-Aligned Diffusion strategy and {Cross-Granularity Interaction} module effectively capture subtle diagnostic features in skin lesions.

On the \textbf{APTOS2019} dataset, our method secures the highest Accuracy ({0.848}), outperforming the Transformer-based UniFormer and DiffMIC-v2. Regarding F1-score, we observe a marginal drop compared to DiffMIC-v2 (0.666 vs. 0.669). This slight dip aligns with the well-known {robustness-accuracy trade-off} in deep learning. The baseline likely exploits brittle, high-frequency shortcuts to maximize clean performance. By enforcing geometric consistency on the logit manifold, our Simplex-Aligned strategy acts as a strong regularizer: it suppresses reliance on these unstable features, resulting in a negligible cost on clean data but substantial gains in noise resilience. As demonstrated in Section~\ref{robust}, while other methods maintain high accuracy, they exhibit catastrophic failure under noise, whereas our method preserves robust performance.

% On the \textbf{APTOS2019} dataset, our method secures the highest Accuracy ({0.848}), outperforming the Transformer-based UniFormer and DiffMIC-v2. While our F1-score is highly competitive and comparable to DiffMIC-v2 (0.666 vs. 0.669), we observe that discriminative methods like UniFormer show a slight advantage in this specific metric. However, it is crucial to note that marginal gains on clean benchmarks are often less critical in clinical deployment than reliability under uncertainty. The primary contribution of our Simplex-Aligned framework is not merely to push the ceiling of clean accuracy but to fundamentally resolve the geometric conflict in diffusion models, thereby ensuring stability under distribution shifts. As we will demonstrate in Section~\ref{robust}, while other methods maintain high clean accuracy, they exhibit catastrophic failure under noise, whereas our method preserves robust performance.

\begin{table}[htbp]
%\vspace{-3mm}
\centering
\floatconts
  {tab:combined_results}
  {\caption{Quantitative comparison with state-of-the-art methods on {HAM10000} and {APTOS2019} datasets. The best results are highlighted in \textbf{bold}, and the second best are \underline{underlined}.}}
  {%
    \vspace{-6mm}
    \setlength{\tabcolsep}{3pt} 
    \resizebox{\linewidth}{!}{
    \begin{tabular}{c|c|cccccccccc}
    \hline
    \multicolumn{2}{c|}{\bfseries Methods} & \bfseries LDAM & \bfseries OHEM & \bfseries MTL & \bfseries DANIL & \bfseries CL & \bfseries ProCo & \bfseries DGCMM & \bfseries UniFormer & \bfseries DiffMIC-v2 & \bfseries Ours \\
    \hline
    \multirow{2}{*}{\bfseries HAM10000} 
    & Accuracy & 0.857 & 0.818 & 0.811 & 0.825 & 0.865 & 0.887 & {0.886} & {0.889} & \underline{0.883} & \textbf{0.894} \\
    & F1-Score & 0.734 & 0.660 & 0.667 & 0.674 & 0.739 & 0.763 & {0.794} & {0.802} & \underline{0.823} & \textbf{0.826} \\
    \hline
    \multirow{2}{*}{\bfseries APTOS2019} 
    & Accuracy & 0.813 & 0.813 & 0.813 & 0.825 & 0.825 & 0.837 & {0.845} & \underline{0.847} & 0.839 & \textbf{0.848} \\
    & F1-Score & 0.620 & 0.631 & 0.632 & 0.660 & 0.652 & 0.674 & {0.685} & \underline{0.690} & 0.669 & 0.666 \\
    \hline
    \end{tabular}
    }
    %\vspace{-3mm}
  }
\end{table}


\subsection{Robustness Analysis}
\label{robust}
To comprehensively assess model reliability under distribution shifts, we adopted a two-fold evaluation strategy: a \textbf{Continuous Stress Test} using Gaussian noise to measure degradation dynamics, and a \textbf{Clinical Artifact Benchmarking} using specific corruptions to simulate acquisition failures in clinical settings.

In these experiments, we benchmark primarily against {DiffMIC-v2}, as it represents the current state-of-the-art in generative medical image classification. Our objective is to isolate the {geometric stability} of the generative label diffusion process itself. Therefore, DiffMIC-v2 serves as the most direct control to validate whether our Simplex-Aligned strategy effectively resolves the manifold mismatch problem inherent in prior generative classifiers.

\subsubsection{Continuous Stress Test: Degradation Dynamics}
\label{sec:stress_test}
We first conduct a continuous stress test by injecting additive Gaussian noise with intensity $\sigma \in [0.05, 0.30]$ to simulate stochastic sensor degradation. Figure~\ref{fig:robustness_combined} visualizes the degradation trends on APTOS2019 and HAM10000, respectively.

\begin{figure}[t!] 
\floatconts
  {fig:robustness_combined}
  {\caption{{Continuous Stress Test Analysis.} Degradation trends under Gaussian noise ($\sigma \in [0.05, 0.3]$). Our method (Red) demonstrates superior stability compared to the baseline (Blue).}}
  {%
    \centering
    \includegraphics[width=0.8\linewidth]{aptos_robustness_curves.pdf} \\
    
    %\vspace{1mm} 
    
    \includegraphics[width=0.8\linewidth]{ham10000_robustness_curves.pdf}
    %\vspace{-8mm}
  }
  %\vspace{-7mm}
\end{figure}

%\vspace{0.5em}
\noindent \textbf{Resistance to Model Collapse.} 
As shown in Figure~\ref{fig:robustness_combined} (Top), on the APTOS dataset, the baseline exhibits a {catastrophic collapse} in agreement metrics. Specifically, as $\sigma$ increases to 0.3, its Cohen's Kappa drops precipitously from 0.73 to nearly zero (0.016), indicating that the model has degenerated to random guessing or majority-class prediction. In stark contrast, our Simplex-Aligned framework demonstrates {graceful degradation}, maintaining a clinically meaningful Kappa of {0.42} even under severe noise. This suggests that performing diffusion on the unbounded logit manifold prevents the network from being forced into incorrect simplex vertices when the input signal is ambiguous.

%\vspace{0.5em}
\noindent \textbf{Calibration Stability under Uncertainty.} 
A critical requirement for medical AI is reliability, knowing when the model is uncertain. Figure~\ref{fig:robustness_combined} (Bottom) highlights the calibration performance (ECE) on the HAM10000 dataset. The baseline displays a volatile U-shaped error curve, where ECE spikes significantly at high noise levels (reaching 0.17), implying that the model remains over-confident even when its predictions are wrong. Conversely, our method consistently suppresses ECE below {0.07} across the entire noise spectrum. This stability confirms that our probabilistic formulation effectively captures the epistemic uncertainty introduced by noise, ensuring that the model's confidence aligns with its actual predictive capability.

\subsubsection{Clinical Artifact Benchmarking}

Unlike general vision benchmarks that evaluate robustness across a broad spectrum of synthetic distortions (e.g., snow, fog, pixelation), medical imaging requires a strictly {modality-specific} evaluation protocol. Artifacts must be physically plausible within the clinical acquisition pipeline to yield meaningful robustness insights. Therefore, guided by the corruption taxonomy established by the {MedMNIST-C} benchmark~\cite{salvomedmnist}, we selectively utilize standard {ImageNet-C}~\cite{hendrycks2019benchmarking} implementations to simulate only those corruptions relevant to fundus photography and dermatoscopy.

For the \textbf{APTOS2019} dataset (Retina), we select \textbf{Shot Noise} and \textbf{Motion Blur}. We specifically employ Shot Noise (Poisson noise) to rigorously simulate the photon counting statistics inherent in low-light fundus imaging sensors. Additionally, we assess \textbf{Motion Blur}, which models the frequent artifacts caused by patient \textit{eye saccades} or involuntary head movements during exposure, a pervasive challenge in non-mydriatic fundus photography~\cite{salvomedmnist}.

For the \textbf{HAM10000} dataset (Dermatoscopy), we prioritize {motion blur} and {defocus blur}. As dermatoscopic images are typically acquired using handheld devices, they are uniquely susceptible to artifacts caused by operator hand tremors (motion) or improper focal depth~\cite{tschandl2018ham10000, salvomedmnist}. By benchmarking against these targeted distribution shifts, we assess the model's reliability under realistic clinical failure modes.
\begin{table}[htbp]
%\vspace{-3mm}
\floatconts
  {tab:robustness_combined_compact}
  {\caption{{Robustness Comparison on HAM10000 and APTOS2019.} All metrics are reported in decimal format [0, 1]. Best results are \textbf{bolded}.}}
  {%
    % \vspace{-6mm}
    \setlength{\tabcolsep}{3pt} 
    \renewcommand{\arraystretch}{1.1} 
    \resizebox{0.75\linewidth}{!}{%  <-- 改这里，0.85 表示变为页宽的 85%{%
    \begin{tabular}{l|c|c|cc|cc|cc|cc}
    \hline
    \multirow{2}{*}{\bfseries Dataset} & \multirow{2}{*}{\bfseries Noise} & \multirow{2}{*}{\bfseries Sev.} & \multicolumn{2}{c|}{\bfseries Acc $\uparrow$} & \multicolumn{2}{c|}{\bfseries F1 $\uparrow$} & \multicolumn{2}{c|}{\bfseries Kappa $\uparrow$} & \multicolumn{2}{c}{\bfseries ECE $\downarrow$} \\
    \cline{4-11}
     & & & Base & Ours & Base & Ours & Base & Ours & Base & Ours \\
    \hline
    \multirow{6}{*}{HAM10000} 
    & \multirow{3}{*}{Defocus} 
      & 1 & 0.759 & \textbf{0.779} & 0.474 & \textbf{0.531} & 0.408 & \textbf{0.496} & 0.080 & \textbf{0.060} \\
    & & 3 & 0.685 & \textbf{0.703} & 0.198 & \textbf{0.292} & 0.116 & \textbf{0.293} & 0.143 & \textbf{0.100} \\
    & & 5 & 0.679 & \textbf{0.685} & 0.129 & \textbf{0.211} & 0.009 & \textbf{0.237} & 0.179 & \textbf{0.130} \\
    \cline{2-11}
    & \multirow{3}{*}{Motion} 
      & 1 & \textbf{0.842} & 0.826 & \textbf{0.700} & 0.657 & \textbf{0.641} & 0.636 & 0.099 & \textbf{0.073} \\
    & & 3 & 0.728 & \textbf{0.735} & 0.375 & \textbf{0.442} & 0.309 & \textbf{0.384} & 0.099 & \textbf{0.058} \\
    & & 5 & 0.692 & \textbf{0.696} & 0.235 & \textbf{0.281} & 0.187 & \textbf{0.279} & 0.118 & \textbf{0.115} \\
    \hline
    \hline
    \multirow{6}{*}{APTOS2019} 
    & \multirow{3}{*}{Shot} 
      & 1 & 0.140 & \textbf{0.507} & 0.113 & \textbf{0.247} & 0.112 & \textbf{0.209} & 0.405 & \textbf{0.256} \\
    & & 3 & \textbf{0.499} & 0.299 & 0.151 & \textbf{0.212} & -0.005 & \textbf{0.163} & \textbf{0.176} & 0.295 \\
    & & 5 & 0.324 & \textbf{0.608} & 0.176 & \textbf{0.281} & 0.062 & \textbf{0.460} & 0.224 & \textbf{0.108} \\
    \cline{2-11}
  & \multirow{3}{*}{Motion} 
 & 1 & 0.636 & \textbf{0.811} & 0.505 & \textbf{0.617} & 0.628 & \textbf{0.860} & 0.142 & \textbf{0.070} \\
 & & 3 & 0.567 & \textbf{0.634} & 0.332 & \textbf{0.425} & 0.607 & \textbf{0.623} & 0.211 & \textbf{0.128} \\
 & & 5 & 0.486 & \textbf{0.514} & 0.236 & \textbf{0.311} & \textbf{0.458} & 0.366 & 0.265 & \textbf{0.225} \\
    \hline
    \end{tabular}%
    }
  }
\end{table}
% Table~\ref{tab:robustness_combined_compact} present the evaluation results under clinically specific corruptions at discrete severity levels ($\sigma \in \{1, 3, 5\}$).

%\vspace{0.5em}
\noindent \textbf{Resilience to Acquisition Blur (HAM10000).}
As shown in Table~\ref{tab:robustness_combined_compact} (Top), our method demonstrates superior resilience against artifacts common in handheld dermatoscopy. Under \textbf{Defocus Blur}, while the baseline performance degrades rapidly at Level 5, our Simplex-Aligned model retains a Kappa of 0.237, preserving diagnostic utility even under severe out-of-focus conditions. Similarly, for \textbf{Motion Blur}, our method consistently outperforms the baseline at higher severities (Level 3-5). Notably, even in scenarios where DiffMIC-v2 achieves comparable accuracy (e.g., Motion Blur Level 1), our method yields significantly lower ECE (0.07 vs. 0.10). This indicates that our probabilistic scaling ensures the model remains well-calibrated, avoiding the over-confident but wrong predictions typical of standard diffusion models.

%\vspace{0.5em}
\noindent \textbf{Analysis of Sensor Noise and Failure Modes (APTOS2019).}
The results on retinal images (Table~\ref{tab:robustness_combined_compact}, Bottom) reveal critical failure modes in the baseline that are masked by simple metrics. 
\textbf{Under Shot Noise}, we observe a two-stage failure in the baseline. At Level 1, the baseline's accuracy drops to 14.0\%, significantly underperforming random guessing (20\%). This implies hypersensitivity to noise artifacts, where the model likely hallucinates high-frequency noise as pathological lesions. At Level 3, a striking anomaly occurs: the baseline achieves a deceptively high Accuracy of 49.9\% but a \textit{negative} Kappa (-0.01). This indicates \textbf{mode collapse}, where the network defaults to predicting only the majority class to minimize loss, effectively losing all discriminative power. 
In contrast, our method maintains consistent accuracy (50.7\% at Level 1) and a positive Kappa, proving that it preserves structural discrimination capabilities rather than exploiting class priors. 
\textbf{Under Motion Blur}, simulating eye saccades, our method achieves a remarkable performance gain. At Severity 1, we outperform the baseline by over {17\%} in Accuracy (0.811 vs. 0.636). While the baseline struggles to resolve retinal features blurred by motion, our Simplex-Aligned diffusion robustly recovers structural semantics, demonstrating exceptional resilience to acquisition instability.

\subsection{Ablation Study}

As shown in Table~\ref{tab:ablation_study}, we investigated the contribution of each component by incrementally adding the Cross-Granularity Interaction module and the Simplex-Aligned Diffusion strategy to the baseline.

%\vspace{0.5em}
\noindent \textbf{Impact of Simplex-Alignment (Row C vs. A):} The introduction of Simplex-Alignment significantly bolsters model robustness. On APTOS2019, the accuracy under Gaussian noise jumps from 0.534 to 0.662. This confirms that constraining the diffusion process within a continuous logit simplex effectively prevents model collapse when input features are corrupted.

%\vspace{0.5em}
\noindent \textbf{Impact of Interaction Module (Row B vs. A):} The Interaction module enhances feature extraction capability, leading to improved clean accuracy on HAM10000 (0.883 to 0.891). However, relying solely on interaction (Row B) can lead to instability under noise (Noise Acc drops to 0.632 on HAM10000), suggesting that refined features require structural regularization to remain robust.

%\vspace{0.5em}
\noindent \textbf{Synergy of the Framework (Row D):} Our full model (Row D) achieves the best performance among all combination. By coupling the refined features from the Interaction module with the geometric constraints of Simplex-Alignment, we achieve a superior trade-off. Notably, on the challenging APTOS noise task, our method improves accuracy by 18.2\% compared to the baseline (0.534 to 0.716), demonstrating that our components are mutually beneficial rather than redundant.

\begin{table}[htbp]
%\vspace{-3mm}
\floatconts
  {tab:ablation_study}
  {\caption{{Ablation study evaluating the contributions of Cross-Granularity Interaction and Simplex-Aligned Diffusion to classification performance and noise robustness ($\sigma=0.1$).}}}
  {%
    %\vspace{-3mm}
    \setlength{\tabcolsep}{5pt} 
    \resizebox{1.0\linewidth}{!}{
    \begin{tabular}{l|cc|ccc|ccc}
    \hline
    \multirow{2}{*}{\textbf{Model}} & \multicolumn{2}{c|}{\textbf{Components}} & \multicolumn{3}{c|}{\textbf{HAM10000 (Derma)}} & \multicolumn{3}{c}{\textbf{APTOS2019 (Retina)}} \\
    \cline{2-9}
     & Simplex & Interaction & Clean Acc & Clean F1 & Noise Acc ($\sigma=0.1$) & Clean Acc & Clean F1 & Noise Acc ($\sigma=0.1$) \\
    \hline
    \hline
    % Row A: Baseline
    A (Baseline) & $\times$ & $\times$ & 0.883 & \underline{0.823} & 0.706 & \underline{0.839} & \textbf{0.669} & 0.534 \\
    \hline
    % Row B: Interaction Only
    B & $\times$ & \checkmark & 0.891 & 0.821 & 0.632 & 0.834 & 0.664 & \underline{0.714} \\
    \hline
    % Row C: Simplex Only
    C & \checkmark & $\times$ & \underline{0.893} & 0.811 & \underline{0.730} & 0.831 & 0.662 & 0.662 \\
    \hline
    % Row D: Ours
    \textbf{D (Ours)} & \checkmark & \checkmark & \textbf{0.894} & \textbf{0.826} & \textbf{0.750} & \textbf{0.848} & \underline{0.666} & \textbf{0.716} \\
    \hline
    \end{tabular}
    }
  }
\end{table}
\subsection{Qualitative Analysis and Visual Explanations}
\label{sec:qualitative_analysis}

To provide intuitive insights into the decision-making process of our proposed framework, we visualize the Grad-CAM~\cite{selvaraju2017grad} attention maps in Figure~\ref{fig:qualitative_analysis}. These visualizations compare the focus regions of the Baseline and our Simplex-Aligned method.

\textbf{Lesion Localization and Boundary Delineation (APTOS2019).} 
As shown in the top row of Figure~\ref{fig:qualitative_analysis}, the retinal images present challenging pathological features, such as distinct bright lesions known as \textbf{Hard Exudates} and \textbf{Cotton Wool Spots}. The baseline model tends to generate diffused attention maps, often confusing optical artifacts (e.g., light reflections or the optic disc) with actual lesions. In contrast, our Simplex-Aligned method demonstrates superior semantic selectivity. It accurately distinguishes pathological boundaries, focusing precisely on the clusters of Hard Exudates while suppressing background noise. This precise localization indicates that our method learns robust features on the logit manifold rather than overfitting to global image statistics.

\textbf{Robustness to Occlusion and Shape Consistency (HAM10000).}
The bottom row illustrates the model's performance on dermatoscopic images, which are frequently compromised by artifacts such as hair occlusion and ruler markings. 
For \textbf{Class 4 (Melanocytic nevi)}, where the lesion is heavily occluded by dense hair, the baseline's attention is disrupted, tracking the hair strands instead of the pigment network. Our method, however, exhibits remarkable robustness to such occlusions, successfully bypassing the hair artifacts to focus on the underlying lesion patterns. 
Similarly, for \textbf{Class 3 (Benign keratosis-like lesions)}, which typically present with irregular borders, our method captures the \textit{entire} extent of the lesion, delineating a larger and more accurate boundary compared to the baseline's center-biased focus. This confirms that our method can effectively learn the full morphological structure of the disease across different classes. We provide more visualization results covering additional classes and failure cases in Appendix~\ref{app:more_qualitative}.
\begin{figure*}[t] 

\centering
%\vspace{-4mm}
\floatconts
  {fig:qualitative_analysis}
  {\caption{{Qualitative Grad-CAM comparison between the baseline and our method, illustrating superior lesion localization in APTOS2019 (top) and resilience to clinical artifacts in HAM10000 (bottom).}
  }}
  {%
    \setlength{\tabcolsep}{2pt} 
    \renewcommand{\arraystretch}{0.5} 
    \begin{tabular}{cc}
      
      % --- Row 1: APTOS Example (Top) ---
      \includegraphics[width=0.48\linewidth]{APTOS_ID129_GT2_Base1.png} &
      \includegraphics[width=0.48\linewidth]{APTOS_ID431_GT2_Base1.png} \\
      
      % 间距行：跨 2 列
      \multicolumn{2}{c}{
      \vspace{1mm}} \\ 

      % --- Row 2: HAM10000 Example (Bottom) ---
      \includegraphics[width=0.48\linewidth]{ID1751_GT4_Base2_Simp4.png} &
      \includegraphics[width=0.48\linewidth]{ID1760_GT3_Base4_Simp3.png} \\
      %\vspace{-9mm}
      
    \end{tabular}
  }
  %\vspace{-3mm}
\end{figure*}

% \begin{figure}[htbp]
% \floatconts
%   {fig:aptos1}
%   {\caption{Sample 179 (GT=2)}}
%   {\includegraphics[width=\linewidth]{Sample_179_GT2.png}}
% \end{figure}

% \begin{figure}[htbp]
% \floatconts
%   {fig:aptos2}
%   {\caption{Sample 66 (GT=2)}}
%   {\includegraphics[width=\linewidth]{Sample_66_GT2.png}}
% \end{figure}

% \begin{figure}[htbp]
% \floatconts
%   {fig:aptos2}
%   {\caption{Sample 183 (GT=4)}}
%   {\includegraphics[width=\linewidth]{Sample_183_GT4.png}}
% \end{figure}
{\subsection{Computational Efficiency and Practical Overhead}}

{Diffusion-based classifiers inherently involve iterative inference, which introduces additional computational overhead compared to standard discriminative CNNs. However, our Simplex-Aligned Diffusion significantly reduces this overhead relative to prior diffusion-based medical classifiers by operating in a low-dimensional logit space rather than the high-dimensional pixel space.}

{As shown in Table~\ref{tab:efficiency}, our method achieves an inference latency of 2.40 ms per image, which is slightly lower than DiffMIC-v2 (2.51 ms), while reducing the total computational cost by approximately 39\% in GFLOPs (168.8G vs.\ 278.0G) and requiring less peak GPU memory.}

{Although standard CNNs remain faster in absolute terms, our goal is not to match their raw throughput, but to substantially reduce the overhead of diffusion-based classifiers while achieving improved robustness and calibration under acquisition shifts. In this context, the additional $\sim$2 ms latency represents a favorable trade-off for safety-critical medical applications, where reliability and calibrated uncertainty are essential.}

\begin{table}[ht] % Added 'h' to help with placement
\centering
\caption{Computational efficiency and hardware overhead comparison.}
\vspace{7pt}
\label{tab:efficiency}
\begin{tabular}{lccc}
\toprule
\textbf{Metric} & \textbf{ResNet-50} & \textbf{DiffMIC-v2} & \textbf{Ours} \\
\midrule
Inference Latency (ms/img) & 0.36 & 2.51 & 2.40 \\
Total GFLOPs & 2.05G & 277.99G & 168.80G \\
Peak GPU Memory (GB) & $\sim$0.90 & 2.37 & 1.95 \\
Throughput (FPS) & $\sim$2700 & $\sim$398 & $\sim$416 \\
\bottomrule
\end{tabular}
\end{table}
