\section{Results} \label{sec:experiments}


The proposed method is evaluated on four datasets: the Mitos \& Atypia 14 dataset \citep{Roux2014}, the CAMELYON16 dataset \citep{bejnordi2017diagnostic}, the MIDOG 2021 dataset \citep{midog}, and a proprietary, real-world dataset, see Appendix~\ref{sec:appendix_dataset}.
The first contains coregistered images from two different domains.
This allows the computation of several image-to-image metrics.
Downstream classification is evaluated on the second dataset.
The latter dataset was constructed to investigate occurrences of hallucinations. 
% The latter dataset was constructed to test domain generalization. 
% % To show applicability beyond Mitos \& Atypia 14, in Figure \Cref{fig:midog}, we show qualitative results on two of the domains of MIDOG 2021.
% Finally, in Figure \Cref{fig:midog} in the appendix, we show qualitative results on two of the domains of MIDOG 2021.
Figure~\ref{fig:retrained} and Figure~\ref{fig:hallucination} show images from our proprietary dataset.
They illustrate some of the risks involved with the use of state-of-the-art stain normalization methods.
For Figure~\ref{fig:retrained}, new models were trained on the proprietary data, using the code and hyperparameters provided by the original authors.
Despite this, the StainGAN and StainNet models display clinically significant hallucinations.
There is a clear reduction of identifiable nuclei and, overall, a shift toward an appearance in line with treatment effect. 
The count of so-called ghost cells is not preserved from input to output images, which could profoundly mislead pathologists and alter the grading and interpretation from benign to necrotic.

\begin{table*}[!htbp]
\centering
\caption{
Paired image metrics for the test set of the Mitos \& Atypia 14 dataset.
The $\mathbf{\Delta}$ columns list the improvements over applying the identity transformation.
% , i.e. the similarity between source and target images.
% For the Kullback-Leibler divergence over the colors in the column on the right, lower values are better.
The column on the right is the Kullback-Leibler divergence over the colors.
It does not list deviations as the divergence between the output and the target distribution is a single value.
The deviations in the SSIM and PSNR columns indicate that there is no clear best method for these metrics, as the error margins overlap for all approaches. 
The SSIM$_{src}$ is a measure of the preservation of information from the source image, a crucial property for medical diagnosis, \rebutextra{see section~\ref{sec:stain_transfer_results}.}
% As apparent from the SSIM source column, 1$\times$1 Stainer preserves the structures of the input image.
% It does so by matching the colors of the target domain (low Color KL).
% However, as a color-to-color function, its output distribution is strongly influenced by the occurrences of colors in the source domain.
% StainGAN and other generative methods do not have this constraint.
% \todo{Delta values for non-DL}
}\label{tab:results}
\fontsize{10pt}{12pt}\selectfont
\begin{tabular}{lcccccc}
\hline
\textbf{Method} & \textbf{SSIM ($\uparrow$)} & $\mathbf{\Delta}$ \textbf{SSIM}  & \textbf{PSNR ($\uparrow$)}& $\mathbf{\Delta}$ \textbf{PSNR} & \textbf{SSIM$_{src}$} & \textbf{KL ($\downarrow$)} \\ 
\hline
source & .641 ± .108 & 0.0 & 20.3 ± 3.2 & 0.0 & 1.0 & 1.85\\
\hline
Reinhard & .617 ± .106 & .001 ± .028 & 19.9 ± 2.1 & 2.16 ± 2.65 & .964 ± .031 & 0.528\\ 
Macenko & .656 ± .115 & .026 ± .019 & 20.7 ± 2.7 & 1.57 ± .91 & \textbf{.966 ± .049} & 1.31\\ 
Vahadane & .664 ± .116 & .029 ± .022 & 21.1 ± 2.8 & 1.81 ± .93 & \textbf{.967 ± .046} & 1.70\\ 
StainDiff$^a$ & .721 ± .017 &&&&\\ 
StainGAN & .706 ± .099 & \textbf{.061 ± .028} & 22.4 ± 2.6 & 2.11 ± 1.24 & .883 ± .025 & 0.095\\ 
StainNet & .691 ± .107 & \textbf{.050 ± .013} & 22.5 ± 3.3 & 2.22 ± .65 & .957 ± .007 & 1.04\\ 
% 2025-06-19__10_13_01__grid_mitos_only_proj_759378_GridNodeV23_1012715/generator/max_discriminator_value.pth"

\hline
1$\times$1 Stainer & .651 ± .107 & .010 ± .005 & 22.0 ± 3.4 & 1.73 ± .58 & \textbf{.997 ± .0003} & 0.200\\
% \todo{No color loss} & & & & & &\\
% 115 param& .644 ± .106 & .003 ± .009 & 20.4 ± 2.0 & .10 ± 1.46 & .997 ± .0004 &\\
\hline
\multicolumn{7}{l}{$^a$Results are taken from the original paper.}

\end{tabular}
\end{table*}

\begin{table*}[h]
\centering
\caption{
The performance of a set of pretrained tumor-normal classifiers on the test set of \\CAMELYON16 after input normalization.
% The networks are trained for tumor-normal classification on the target domain.
For testing, the input images originate from the source domain, transformed by the listed methods.
AUC$_{retro}$ shows the performance when the normalization is retroactively applied to the classifier that performed best in the source domain.
AUC$_{best}$ lists the best observed classifiers.
}\label{tab:downstream}
\fontsize{10pt}{12pt}\selectfont
\begin{tabular}{lccccc}
\hline
\textbf{Method} & \textbf{Precision} & \textbf{Recall}  & \textbf{AUC} & \textbf{AUC$_{retro}$} & \textbf{AUC$_{best}$}\\ 
\hline
source & .651 ± .032 & .974 ± .013 & .723 ± .034& .794 & .794\\
source augmented & .829 ± .023 & .820 ±  .033& .824 ± .011& NA & .849\\
% \rebutextra{no data shift}& .915 ± .002 & .900 ±  .003& .908 ± .002& NA & .913\\
\hline
Vahadane & .892 ± .026 & .746 ±  .045& .827 ± .011& .823 & .847\\
StainGAN & .902 ± .027 & \textbf{.879 ± .020}& \textbf{.890 ± .011}& .890 & .897\\
StainNet & \textbf{.952 ± .020} & .813 ± .026& \textbf{.885 ± .005}& .880 & .893\\
ContriMix & .807 ± .030 & \textbf{.865 ± .031}& .828 ± .011& .835 & .842\\
\hline
1$\times$1 Stainer (ours)& \textbf{.937 ± .016} & .812 ± .023 & \textbf{.878 ± .004} & .880 & .898\\
\hline
\end{tabular}
\end{table*}


\subsection{Stain transfer results}\label{sec:stain_transfer_results}
Table~\ref{tab:results} lists image similarity results for several histopathology domain adaptation methods.
The numbers for Reinhard, Macenko, and Vahadane are as reported by the authors of StainNet.
To evaluate the performance, the similarity is measured between output images and the matching images from the target domain.
% Structured Similarity Index Metric (SSIM) and Peak Signal-to-Noise Ratio (PSNR) are included.
To counter the high variance over the images in the test set, the improvement over doing nothing, i.e. the similarity between source and target images, is also reported under $\mathbf{\Delta}$ SSIM and $\mathbf{\Delta}$ PSNR.
As was done for StainNet, we list SSIM$_{src}$ as a measure of the preservation of source image texture.
It is calculated as the SSIM between the input image in grayscale and the output image in grayscale.
The final column on the right shows how well the target color distribution is matched.
For this, the discrete color distributions are put in a total of $32\times32\times32$ bins each containing $8\times8\times8$ RGB colors and the Kullback-Leibler divergence is calculated between the distribution over output images and the target distribution.
% \\

Our method is competitive with others with respect to similarity.
\rebutextra{A number of samples are included in the appendix for qualitative comparison, see Figure~\ref{fig:appendix_many_methods_nozooms}.}
Note that paired pathology data always involves two physical acts of scanning, resulting in differing artifacts, such as focal planes and blur.
It may not be desirable to perfectly reproduce these variations.
\rebutextra{
On the other hand, if the input image suffers from artifacts, our method will not cover up its poor quality
(Figure~\ref{fig:qual_three_blocks} in the appendix). 
}
Such transformations could lead to hallucinations.
% In any case, as per design, a color mapping function is not capable of recreating structural differences.
% On the other hand, as apparent from the SSIM source column, \ours preserves the structures of the input image.
% It does so by matching the colors of the target domain.
% However, as a color-to-color function, its output distribution is strongly influenced by the occurrences of colors in the source domain.
% StainGAN and other generative methods do not have this constraint.
\ours matches the target color distribution well, though not as closely as StainGAN, which is not restricted to color mappings and therefore not as influenced by the color occurrences in the source domain.
To further illustrate the color matching, Figure~\ref{fig:histograms} in the appendix compares RGB histograms.
% To further illustrate the color matching, Figure~\ref{fig:histograms} compares the RGB histograms of StainNet, our method and the target domain.
Finally, SSIM$_{source}$ shows that \ours outperforms other methods in terms of structural preservation, or absence of hallucinations.



\subsection{Downstream tumor classification}
Following evaluations described in StainNet, we train a normalization model on CAMELYON16.
% Next, we train $20$ SqueezeNet models \citep{iandola2016squeezenet} to classify images from the train set of the target domain, collected at Radboud University Medical Center.
% The images have either the label `normal' or `tumor'.
% Finally, we evaluate the classifiers on the test set images from the source domain, from University Medical Center Utrecht.
% Table~\ref{tab:downstream} shows the performance gains when these test images are first normalized.
% The SqueezeNet architecture is not the most advanced solution but suffices to compare the benefits of normalization.
We then evaluate the effect on a set of $20$ classifiers pretrained on the source train set.
\rebutextra{Vahadane \citep{vahadane2016structure} is taken as a representative method for non-deep learning based approaches, as previous works have shown its superiority over others in downstream classification \citep{shaban2019staingan}.}
The line `source' is the baseline without normalization.
As another baseline, `source augmented' contrasts normalization with applying color jitter during the training of the classifiers.
AUC$_{retro}$ reflects a realistic deployment scenario: it measures the performance improvement when normalization is retroactively applied to the best-performing classifier trained on unnormalized source data.
% This is particularly relevant when retraining is impractical or when working with third-party models.
AUC$_{best}$, on the other hand, represents the upper bound of performance observed in our experiments. 
It lists the best-performing classifier for each normalization method.
Our model is among the best in our test.
It is competitive with StainGAN and StainNet, with overlapping error margins.

\rebutextra{
Upon visual inspection of samples that are misclassified through our method but correctly processed after normalization by StainGAN or StainNet, a pattern was observed.
Many of the mistaken source images contain unnatural colors as scanning artifacts (Figure~\ref{fig:qual_three_blocks} in the appendix).
As by design, our method retains the rarity of these colors and this might have lead to the confusion of the classifier.
By contrast, StainGAN and StainNet paint over these anomalous pixels with common colors.
In this way, they suffer from color information loss.
Even though, in these cases, artifacts are modified, in our opinion, stain normalization and domain adaptation models are not trained to distinguish between artifacts and rare clinically relevant colors.
The responsibility of quality assurance is better left to dedicated models and expert human supervision.
The potential marginal downstream gains from allowing normalization models to modify anomalous pixels and patches to what's expected does not outweigh the risk of masking out rare semantically relevant information.
}

\subsection{Quantitative measures of hallucination}
Structure Discrepancy \citep{moens2026hallucinates} is a recently proposed method to quantify hallucinations in histopathology images.
Based on edge detection, high values suggest that a structure was removed or inserted.
Though impossible to perfectly align with medically salient hallucinations, especially at lower values, see Table~\ref{tab:structure}, it is effective at finding examples of unwanted modifications.
\rebutextra{Due to the computational and time costs of retraining methods on large, challenging datasets, we are forced to limit the comparison to StainGAN and StainNet. 
A more extensive quantitative review of hallucinations in state-of-the-art stain normalization is left for future work.}
\todo{Could shorten the next sentence and white space if needed.}
Figure~\ref{fig:structure_discrepancy} shows that our model does not produce outlier discrepancies, with lower observed maximum values in our 24k test set.
This empirically validates that our method avoids introducing rare yet clinically significant hallucinations.

To quantify the retention of infrequent colors discussed in Section~\ref{sec:infrequent}, we compared the relative occurrences of colors in the source domain to their occurrences in the target domain after normalization.
This was done uniformly over all RGB colors, to give as much weight to an outlier color as to an expected one.
In Figure~\ref{fig:infrequent_colors} our method is nicely centered around and concentrated at zero.
On the other hand, StainNet tends to map infrequent colors to more common ones, resulting in a loss of information.
Given their similar construction, see Appendix~\ref{sec:appendix_implementation}, this shows that our method brings all the advantages of StainNet while improving on hallucination resilience.


\begin{figure*}
\centering
\subfigure[Hallucinations are hard to define. Structure Discrepancy is a measure to identify potential hallucinations. See Table~\ref{tab:structure} in the Appendix for examples. Importantly, this experiment indicates a risk of higher discrepancies with StainGAN and StainNet.]{
    \includegraphics[width=.45\linewidth]{figures/vs_staingan_stainnet.png}
    \label{fig:structure_discrepancy}
}
\subfigure[These box plots show how color rarity shifts after normalization. A positive value means that an infrequent color is mapped to a more common one. \newline StainGAN can not be compared against as it requires the surrounding pixels. \newline\nolinebreak{1$\times$1 w/o skip} is an ablation.]{
    \includegraphics[width=0.46\linewidth]{figures/signed_color_bin_differences_comparison.png}
    \label{fig:infrequent_colors}
}
\caption{Quantitative results for hallucination resilience and infrequent color retention.} 
\label{fig:boxplots}
\end{figure*}



\section{\rebutextra{Ablation studies}}

\begin{table*}[h]
\centering
\caption{
\rebutextra{Ablation study results on CAMELYON16 downstream classification. 
Rows compare the impact of weight regularization (w/o reg), residual skip connection from input to output (w/o skip), and network depth (1 layer, 6 layer) against the proposed 3-layer architecture.}
}\label{tab:ablation}
\fontsize{10pt}{12pt}\selectfont
\begin{tabular}{lccccc}
\hline
\textbf{Method} & \textbf{Precision} & \textbf{Recall}  & \textbf{AUC} & \textbf{AUC$_{retro}$} & \textbf{AUC$_{best}$}\\ 
\hline
1$\times$1 Stainer (ours)& \textbf{.937 ± .016} & .812 ± .023 & \textbf{.878 ± .004} & .880 & .898\\
1$\times$1 Stainer w/o reg& .891 ± .015 & .774 ± .021 & .839 ± .006 & .842 & .852\\
1$\times$1 Stainer w/o skip& .854 ± .017 & .797 ± .018 & .830 ± .009 & .833 & .844\\
\rebutextra{1$\times$1 Stainer 1 layer}& \textbf{.944 ± .015} & .789 ± .030 & .870 ± .008 & .873 & .892\\
\rebutextra{1$\times$1 Stainer 6 layer}& \textbf{.969 ± .010} & .738 ± .036 & .857 ± .013 & .853 & .906\\
\hline
\end{tabular}
\end{table*}

\rebutextra{
To better understand the contributions of different components in our method, we conduct ablation studies examining the impact of weight regularization, network depth, and training data distribution on downstream classification performance.
}

\subsection{\rebutextra{Weight regularization}}
As an ablation, we trained our model without regularization on the network weights.
Only the color distribution dissimilarity was used for the loss function.
Without this regularization, we expect the learned color mapping to be less smooth and differ more from the identity mapping.
The impact on downstream classification is shown in Table~\ref{tab:ablation}.
The performance of the model without regularization dropped compared to the regularized version, as seen in the lower precision, recall, and AUC values. 
However, it remains on par with the source augmented baseline.
When the model is furthermore trained without the skip connection, the convergence becomes less stable.
The training process risks ending on local minima with mappings akin to color inversions.

\subsection{\rebutextra{Network depth}}
\rebutextra{
To investigate the impact of network depth on performance, we trained variants with 1 layer and 6 layers, compared to the default 3-layer architecture.
The results are summarized in Table~\ref{tab:ablation}.
% The 1-layer model achieves competitive AUC (.870 ± .008) with high precision (.944 ± .015) but lower recall (.789 ± .030).
% The 6-layer model exhibits the highest precision (.969 ± .010) but suffers from reduced recall (.738 ± .036) and a lower AUC (.857 ± .013).
% This suggests that deeper networks tend to be more conservative, achieving higher precision at the cost of recall.
% The 3-layer architecture provides the best balance, achieving .878 ± .004 AUC with .937 ± .016 precision and .812 ± .023 recall.
% While deeper networks can learn more complex color mappings, they risk overfitting to the training distribution, resulting in less robust generalization.
}

\subsection{\rebutextra{Training data distribution}}\label{sec:ablation_data_distribution}
\rebutextra{
The convergence of our color distribution matching is dependent on the content of the source and target datasets used during training.
It works best when the same physical true colors are digitized just as often in both domains.
For example, this condition is trivially satisfied in a dataset of paired images.
The weight decay regularization incentivizes smooth mappings close to the identity.
This ensures that the occurrences do not need to match exactly.
}
\par
\rebutextra{
A setting in which a mismatch in colors could occur is when the source and target domains contain different tissue types.
To test the applicability in such a scenario, we trained a model
on a heterogeneous set of tissue slides from our proprietary data to normalize to the Mitos \& Atypia 14 target domain for qualitative evaluation.
The set contained 34 brain slides, 1030 lung slides, 514 pancreas slides, 287 skin slides, 842 uterus slides, and importantly does not include any breast tissue slides.
A few random examples of the diversity are show in Figure~\ref{fig:heterogeneous_train_source} in the appendix.
Qualitative comparisons to test target patches are included in Figure~\ref{fig:appendix_heterogeneous_eval}.
Even in this challenging setting, our method retains its high SSIM$_{src}$ of .997 and has a PSNR of 21.3 ± 3.1, still with largely overlapping error margins.
It should be noted that GAN or diffusion based models would, without strong conditioning, most likely fail in this setting. 
Even with cycle consistency, the differences between image content in the domains would incentivize such model to alter the content of inputs.
}
