

\section{Experiments} \label{sec:experiments}

%\subsection{Setup}

\begin{wraptable}{r}{2.2in}
\vspace{-.3in}
\centering
\caption{\footnotesize Summary of datasets.}\label{tab:datasets}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lcc}
\toprule
\textbf{Dataset} & \textbf{\# Images} & \textbf{\# Classes} \\
\midrule
DermaMNIST  & 10{,}015 & 7 \\
MILK-10K    & 5{,}240  & 11 \\
PAD-UFES-20 & 2{,}298  & 6 \\
\bottomrule
\end{tabular}}
\vspace{-.15in}
\end{wraptable}
\paragraph{Datasets.} We evaluate our framework on three publicly available dermoscopic image datasets:
DermaMNIST, MILK-10K, and PAD-UFES-20. Together they cover a range of lesion types,
acquisition devices, and class distributions. All images are expert annotated dermoscopic RGB scans, resized to a fixed resolution before training.
DermaMNIST~\cite{yang2021medmnist}, derived from the HAM10000 dataset~\cite{tschandl2018ham10000} from the ISIC 2018 challenge~\cite{codella2019isic}, contains 10,015 dermoscopic images across seven diagnostic categories and is a standard skin lesion classification benchmark.
{MILK-10K}~\cite{philipp2025milk10k} is a collection of approximately 10{,}000 dermoscopic and clinical images across eleven classes.
To keep the setting consistent across datasets, we use only the dermoscopic subset (5{,}240 images) in our experiments.
{PAD-UFES-20}~\cite{pacheco2020pad} comprises 2{,}298 smartphone-acquired dermoscopic images labeled by pathologists into six diagnostic categories, introducing realistic variability in acquisition conditions and illumination.

\noindent \textbf{Data splits and cross dataset evaluation.} \quad For each dataset we perform an 80:10:10 split into training, validation, and test sets (grouped by patient identifier where available to avoid leakage). We first train and evaluate models on each dataset independently. Furthermore, to assess \textbf{cross-dataset generalization}, we also conduct a transfer experiment: models are trained on the five lesion classes common to DermaMNIST and the dermoscopic subset of MILK-10K, and evaluated on the corresponding subset of PAD-UFES-20.

\smallskip

\noindent \textbf{Preprocessing and Topological Features.} \quad
%\label{sec:preprocessing}
All dermoscopic images are resized to $224\times224$ pixels to ensure uniform spatial resolution across datasets. Standard normalization is applied to the RGB channels before feature extraction.

\noindent \textit{Single Persistence Topological Descriptors.}  \quad
For the single persistent homology setting, we compute Betti features using 50 filtration thresholds (\textit{n\_bins} = 50). For each image, this produces 50-dimensional vectors for Betti-0, Betti-1, and the number of activated pixels, computed independently across the RGB and grayscale channels. This results in a total of $4 \text{ (channels)} \times 3 \text{ (feature types)} \times 50 = 600$ topological features per image.

\noindent \textit{MultiPersistence Topological Descriptors.}  \quad
To capture richer structural interactions, we further compute cubical multi-parameter persistence using bifiltration over the red and green channels with 20 thresholds each. This yields a $20 \times 20$ bifiltration grid for each image, producing a 2D topological map with three channels corresponding to Betti-0, Betti-1, and activated pixels calculated during bifiltration. When flattened, this representation provides $20 \times 20 \times 3 = 1200$ multi-parameter topological features per image.

These topological features are first evaluated independently using XGBoost classifiers to assess their discriminative capability. Subsequently, the multi-parameter topological maps are integrated into the \textbf{TopoCon-MP} framework, where they are combined with Swin Transformer embeddings for supervised contrastive learning and classification.

\smallskip

\noindent \textbf{Hyperparameters.} \quad
We used the Adam optimizer with a learning rate of 1e$^{-4}$, batch size of 64, and cross-entropy loss with class weights for all experiments.
All images were resized to $224\times224$ and normalized with the ImageNet mean and standard deviation.
Baseline CNN and ViT models, and our TopoCon-MP model were trained for 15 epochs without augmentation, using ImageNet-pretrained backbones with frozen weights while keeping BatchNorm statistics and Dropout layers active.
 
For the transfer learning setup (training on MILK-10K and DermaMNIST, testing on PAD-UFES-20), we used identical hyperparameters and applied early stopping based on validation macro-F1 (patience = 3).





Our contrastive fusion model (TopoCon-MP) was trained with the AdamW optimizer (weight decay of 1e$^{-2}$) for 30 epochs under cosine-annealing scheduling.
The model fused Swin-T image embeddings with $3\times20\times20$ topological maps derived from 1200-dimensional multi-persistence vectors.
Each vector was normalized in 400-dimensional blocks and reshaped to a $3\times20\times20$ tensor.
The fusion module consisted of LayerNorm, a linear layer, ReLU activation, and a dropout rate of 0.3.
The projection head for contrastive learning was a two-layer MLP with dimensions $768\rightarrow256\rightarrow64$, LayerNorm and ReLU activations, mapping fused embeddings to a 64-dimensional latent space.
We used a supervised contrastive loss with $\lambda=0.1$ and temperature $\tau=0.07$, combined with cross-entropy loss for classification. All models used AMP, TF32, and gradient clipping at 1.0. The best model was selected based on the highest validation AUC. Multipersistence feature extraction on the DermaMNIST dataset was performed on an HPC system using single-core CPU execution. For 224×224 images, computing the 3×20×20 (Betti-0, Betti-1, activated pixels) tensor requires 0.229 s per image on average, corresponding to an overall preprocessing time of approximately 38 minutes for the full DermaMNIST dataset (10,015 images).Training TopoCon-MP with a frozen Swin-T backbone on the DermaMNIST dataset required approximately 55 minutes on an HPC system using a single NVIDIA A100 GPU. Our code is available at \url{https://github.com/sayoni-c98/MIDL2026-TopoConMP}



\paragraph{Baselines.} %\label{sec:baselines}
We compare our approach against widely used 2D convolutional and transformer-based architectures. For convolutional baselines, we include {MobileNetV3-Large-100}~\cite{howard2019mobilenetv3}, {DenseNet121}~\cite{huang2017densenet}, {ResNet50}~\cite{he2016resnet}, and {EfficientNetV2-S}~\cite{tan2021efficientnetv2}, representing compact, densely connected, residual, and compound-scaled network families respectively. For transformer-style baselines, we evaluate {ViT-B/16}~\cite{dosovitskiy2021vit}, {MobileViT-S}~\cite{mehta2022mobilevit}, and {Swin-T}~\cite{liu2021swin}, covering both pure and hybrid vision transformer designs. All models use ImageNet pretrained weights and are fine tuned on the dermoscopic datasets with identical optimization and augmentation settings.




\paragraph{Results.}
Table~\ref{tab:derma_results} reports the performance of all CNN, transformer, and TopoCon-MP models on the three dermoscopic datasets. On every dataset, TopoCon-MP attains the best AUC, accuracy, and macro F1, and generally improves sensitivity and specificity as well. This suggests that fusing multiparameter topological features with a supervised contrastive objective yields more discriminative representations than image-only baselines. On DermaMNIST and MILK-10K, absolute sensitivity remains modest for all methods due to class imbalance and ambiguous cases at a fixed decision threshold, but TopoCon-MP achieves the highest sensitivity while also improving AUC and macro-F1, indicating gains are not driven by a specificity-only tradeoff.



To assess cross-dataset generalization, we train all models on the five shared classes from DermaMNIST and the dermoscopic subset of MILK-10K and evaluate on the corresponding subset of PAD-UFES-20 (Table~\ref{tab:transfer}). In this transfer setting, ViT-B/16 achieves the highest AUC and F1, while TopoCon-MP remains competitive: its AUC is within 0.4 of ViT-B/16, it obtains the second-best accuracy, and it consistently outperforms the CNN baselines across metrics. These results indicate that multipersistence features help stabilize performance under domain shift, even though our current fusion does not yet surpass the strongest transformer.


\begin{table*}[t]
   \centering
    \large
\caption{\footnotesize \textbf{Baseline comparison.} We compare TopoCon-MP with strong pretrained CNN and ViT models. The {\best{best}}, {\second{second}}, and {\third{third}} scores in each column are highlighted.}
    \label{tab:derma_results}
    \setlength\tabcolsep{4pt}
    
    \resizebox{\linewidth}{!}{%
    \begin{tabular}{lccccc||ccccc||ccccc}
        \toprule
        & \multicolumn{5}{c}{\textbf{DermaMNIST}} 
        & \multicolumn{5}{c}{\textbf{MILK-10K}} 
        & \multicolumn{5}{c}{\textbf{PAD-UFES-20}} \\[2pt]
        \cmidrule(r){2-6} \cmidrule(lr){7-11} \cmidrule(lr){12-16}
        \textbf{Model} 
& \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.}
& \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.}
& \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.} \\
\midrule
        
        MobileNetV3 
        & 88.5 & 63.4 & 43.3 & 50.0 & 93.1
        & 81.9 & 50.6 & 29.8 & 32.3 & 94.6
        & 82.7 & 52.0 & 42.7 & 45.5 & 89.9 \\
        
        DenseNet121           
        & 87.4 & 59.4 & 38.5 & 50.2 & 93.0
        & 77.3 & 45.4 & 25.2 & 28.6 & 94.1
        & 78.2 & 47.4 & 40.8 & 44.8 & 88.6 \\

        ResNet50              
        & 84.2 & 61.1 & 38.2 & 45.9 & 92.8
        & 79.9 & 47.9 & 20.7 & 22.4 & 94.0
        & 74.2 & 50.3 & 35.0 & 35.7 & 89.2 \\

        EfficientNetV2-S      
        & 82.5 & 56.3 & 36.0 & 43.9 & 91.9
        & 73.7 & 42.9 & 27.1 & 34.0 & 93.9
        & 75.1 & 43.4 & 37.4 & 41.3 & 88.2 \\

        ViT-B/16              
        & \second{93.7} & \second{75.7} & \second{61.5} & \third{61.6} & \best{94.5}
        & 79.4 & \third{55.3} & \third{35.9} & 37.8 & 94.9
        & 87.7 & \third{66.5} & \third{56.4} & \third{55.6} & \third{93.2} \\

        MobileViT-S           
        & 89.1 & 68.9 & 54.0 & \second{61.8} & 94.3
        & \third{82.5} & 52.1 & 34.3 & \third{38.1} & \third{94.9}
        & \third{87.7} & 64.7 & 48.0 & 49.6 & 92.1 \\

        Swin-T                
        & \third{93.3} & \third{73.7} & \third{57.1} & \best{63.0} & \second{94.4}
        & \second{87.7} & \second{69.3} & \second{47.5} & \second{46.9} & \second{96.2}
        & \second{88.6} & \second{75.7} & \second{60.3} & \second{59.8} & \second{94.6} \\
        
        \midrule

        TopoCon-MP 
        & \best{94.9} & \best{79.3} & \best{62.2} & 59.7 & \third{94.3}
        & \best{94.5} & \best{71.0} & \best{48.9} & \best{50.4} & \best{96.7}
        & \best{93.0} & \best{77.8} & \best{73.0} & \best{72.0} & \best{95.0} \\
        
        \bottomrule
    \end{tabular}}%
    \vspace{-.15in}
\end{table*}

\begin{wraptable}{r}{3in}
\vspace{-.15in}
\centering
\caption{\footnotesize \textbf{Cross dataset transfer to PAD-UFES-20.}
Models are trained on the five shared classes from DermaMNIST and the
dermoscopic subset of MILK-10K and evaluated on PAD-UFES-20 without
fine tuning. For each metric, the \best{best}, \second{second}, and
\third{third} scores across models are highlighted.}
\label{tab:transfer}
\vspace{4pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lccccc}
\toprule
\textbf{Model}& \textbf{AUC} & \textbf{Acc} & \textbf{F1} & \textbf{Sens} & \textbf{Spec} \\
\midrule
{MobileNetV3} 
& 66.2 & 39.3 & 30.9 & 32.5 & 82.8 \\
{DenseNet121}           
& 66.6 & 36.6 & 30.3 & 31.6 & 81.7 \\
{ResNet50}              
& 64.0 & 39.8 & 20.8 & 24.7 & 80.6 \\
{EfficientNetV2-S}      
& 62.4 & 34.9 & 27.2 & 29.3 & 82.4 \\

{ViT-B/16}              
& \best{78.9} & \best{51.3} & \best{45.5} & \best{46.8} & \best{86.2} \\

{MobileViT-S}           
& 73.8 & 41.1 & 33.9 & 38.8 & 83.8 \\

{Swin-T}                
& \second{78.7} & \third{47.6} & \second{42.4} & \second{46.3} & \second{86.1} \\
\midrule
{TopoCon-MP}
& \third{78.5} & \second{50.3} & \third{36.7} & \third{42.9} & \third{85.5} \\
\bottomrule
\end{tabular}}
\vspace{-.1in}
\end{wraptable}
\paragraph{Ablation studies.}
We conduct two ablations to disentangle the contributions of the topological representation and the fusion strategy. First, in Table~\ref{tab:TDA_ablation} we compare single parameter (SP) cubical persistence on grayscale and on all RGB channels with our multiparameter red plus green (MP RG) encoding, using the same XGBoost classifier. On DermaMNIST, SP across all channels attains the best AUC and accuracy, indicating that a simple multi channel filtration already captures most of the useful topology on this relatively clean benchmark. On MILK-10K, MP RG yields the highest AUC, and on the more heterogeneous PAD UFES 20 dataset it substantially improves accuracy, F1, sensitivity, and specificity over SP variants, suggesting that MP becomes more beneficial as acquisition conditions and lesion appearance vary.

Second, Table~\ref{tab:ML_ablation} evaluates different ML classifier models on the same \(3\times20\times20\) multipersistence tensors. XGBoost provides a strong topology only baseline, while a applying 2D CNN on these MP outputs (MP+CNN) performs worse and is unstable on MILK 10K and PAD UFES 20. In contrast, our full TopoCon-MP model (MP+SupCon), which jointly trains image and topology encoders with supervised contrastive alignment, delivers large and consistent gains in AUC, accuracy, F1, sensitivity, and specificity across all datasets. These results indicate that both the multiparameter features and the topology aware contrastive fusion are important for the final performance.





\begin{table*}[h!]
\vspace{-.1in}
    \centering
    \large
    \caption{\footnotesize \textbf{Topological Feature Ablation.} The results of our ablation study of XGBoost models on topological features across different channels and datasets.}
    \label{tab:TDA_ablation}
    \setlength\tabcolsep{4pt}
    
    \resizebox{\linewidth}{!}{%
    \begin{tabular}{lcccccc||ccccc||ccccc}
        \toprule
       & & \multicolumn{5}{c}{\textbf{DermaMNIST}} 
        & \multicolumn{5}{c}{\textbf{MILK-10K}} 
        & \multicolumn{5}{c}{\textbf{PAD-UFES-20}} \\
        \cmidrule(r){3-7} \cmidrule(lr){8-12} \cmidrule(lr){13-17}
        \textbf{TDA model} &\textbf{\# Feat.}
        & \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.}
        & \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.}
        & \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.} \\
        \midrule
        {SP-Grayscale} & 150 
        & 84.8 & 70.2 & 32.9 & 29.0 & 89.8
        & 72.6 & 58.9 & 19.0 & 19.2 & 93.7
        & 74.0 & 46.5 & 27.8 & 28.4 & 86.7 \\
        {SP-All channels} & 600          
        & 92.8 & 75.3 & 49.0 & 42.8 & 92.1
        & 74.9 & 59.5 & 19.6 & 19.8 & 93.8
        & 76.2 & 47.8 & 28.6 & 29.5 & 86.9 \\
        {MP-Red and Green} & 1200              
        & 87.1 & 72.0 & 31.0 & 29.2 & 90.7
        & 79.8 & 56.0 & 19.0 & 19.1 & 93.4
        & 74.2 & 57.0 & 38.0 & 39.1 & 89.3 \\
        \bottomrule
    \end{tabular}}%
    \vspace{-.1in}
\end{table*}

\begin{table*}[t]
    \centering
    \large
    \caption{\footnotesize \textbf{ML Ablation.} The performances of different ML models utilizing our $3\times20\times20$ multipersistence outputs. MP-SupCon row represents our TopoCon-MP model.}
    \label{tab:ML_ablation}
    \setlength\tabcolsep{4pt}
    
    \resizebox{\linewidth}{!}{%
    \begin{tabular}{lccccc||ccccc||ccccc}
        \toprule
        & \multicolumn{5}{c}{\textbf{DermaMNIST}} 
        & \multicolumn{5}{c}{\textbf{MILK-10K}} 
        & \multicolumn{5}{c}{\textbf{PAD-UFES-20}} \\
        \cmidrule(r){2-6} \cmidrule(lr){7-11} \cmidrule(lr){12-16}
        \textbf{TDA features} 
        & \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.}
        & \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.}
        & \textbf{AUC} & \textbf{Acc.} & \textbf{F1} & \textbf{Sens.} & \textbf{Spec.} \\
        \midrule
        
        {MP+XGBoost} 
        & 87.1 & 72.0 & 31.0 & 29.2 & 90.7
        & 79.8 & 56.0 & 19.0 & 19.1 & 93.4
        & 74.2 & 57.0 & 38.0 & 39.1 & 89.3 \\ 
        
        {MP+CNN}
        & 79.3 & 68.5 & 18.9 & 18.5 & 87.7
        & 77.9 & 55.5 & 20.4 & 20.8 & 93.7
        & 71.9 & 50.6 & 31.8 & 33.6 & 87.5 \\
        \midrule
        {MP+SupCon} 
           & \textbf{94.9} & \textbf{79.3} & \textbf{62.2} & \textbf{59.7} & \textbf{94.3}
        & \textbf{94.5} & \textbf{71.0} & \textbf{48.9} & \textbf{50.4} & \textbf{96.7}
        & \textbf{93.0} & \textbf{77.8} & \textbf{73.0} & \textbf{72.0} & \textbf{95.0} \\
        % & \textbf{94.2} & \textbf{77.4} & \textbf{57.3} & \textbf{55.7} & \textbf{94.2}
        % & \textbf{88.2} & \textbf{72.3} & \textbf{36.0} & \textbf{35.4} & \textbf{96.3}
        % & \textbf{92.8} & \textbf{72.1} & \textbf{65.0} & \textbf{64.6} & \textbf{93.3} \\
        
        \bottomrule
    \end{tabular}}%
    \vspace{-.1in}
\end{table*}

\paragraph{Limitations.} While we focus on dermoscopic images, where color and lesion structure are relatively standardized, extending the approach to other modalities with different artifacts and intensity statistics may require redesigning the filtration and preprocessing for stability, and the current multipersistence grid may not be optimal. Future work will study filtration sensitivity and develop modality specific, acquisition robust filtrations.


\begin{wrapfigure}{r}{2.5in}
\vspace{-.1in}
\centering
\includegraphics[width=\linewidth]{figures/betti_DermaMNIST1.png}
\caption{\footnotesize \textbf{DermaMNIST Betti-0 curves.}
Mean \(\beta_0\) curves with 40\% confidence bands for DermaMNIST lesion classes
(AKIEC, BCC, BKL, MEL, NV) as a function of red-channel intensity threshold,
showing class-specific patterns in the evolution of connected components.}
\label{fig:betti-derma}
\vspace{-.35in}
\end{wrapfigure}
\paragraph{Visualization.}
To better understand what the multipersistence descriptors capture, we visualize them on MILK-10K and DermaMNIST. In particular, Appendix~\ref{app:visualization} provides two complementary views for MILK-10K: Figure 7 shows mean $\beta_0$ and $\beta_1$ Betti curves with confidence bands across color channels, while Figure 8 displays classwise median red and green Betti tensors and activated pixel maps. Together, these plots reveal consistent, class specific patterns in multiscale topology, for example broader peaks and shifted hotspots for melanoma and keratinocytic lesions, providing qualitative evidence that our multipersistence features capture clinically meaningful lesion structure rather than arbitrary handcrafted cues. Figure~\ref{fig:betti-derma} shows analogous DermaMNIST red-channel $\beta_0$ curves, revealing class-specific differences in connected-component evolution, including shifts in peak intensity thresholds as well as changes in peak magnitude and decay rate as the sublevel set grows.


\vspace{-.1in}


