\section{Method}
Stain normalization seeks to transform histopathological images such that those acquired through one preprocessing pipeline and scanner setup resemble images processed through a different pipeline.
Crucially, this transformation models only the digitization process, the underlying tissue slide remains unchanged.
% Throughout our study of stain normalization, there were two primary considerations.
% One, hallucinations and in general, instances of corruption of information are unacceptable in the field of medicine.
% This is especially true when dealing with malignancies that only show themselves in subtle ways.
% Learned transformations have a tendency to shift rare cases towards the mean.
% % \\
% Two, in practice, domain adaptation is not something that is performed only once. 
% The data adaptation process is repeated every time a shift is detected.
% For this reason, the process should be straightforward.
% % \\ 
% These two considerations, as well as the additional consideration of the cost of labeled medical data, lead to the design decisions detailed in the following sections.
Given that the most prominent inter-domain differences are typically in color, while texture and structural features originate from the tissue itself, we conceptualize stain normalization as a color flow, a deterministic function operating on pixel values.
By design, this function is spatially invariant, meaning it does not adapt its output based on local image context.
This constraint is intentional, as allowing spatially dependent transformations could introduce hallucinations, where the model fabricates plausible but incorrect features.
% Our results demonstrate that such non-linear, context-independent color flows can serve as effective mappings for domain adaptation.

\subsection{Convolutions of 1\texttimes1 with limited non-linearities}\label{sec:conv}

% The goal of this study is to develop a straightforward approach for identifying a domain adaption function for histopathology. 
As in other medical fields, in histopathology, it is paramount that no information is lost or added as fabrications.
What seems like minor details in a big picture could have a big impact on the diagnosis.
To achieve this information retention, we chose to constrain the space of functions that can be expressed with the network architecture.
Taken to its extreme, a reversible linear transformation between colors is by nature immune to hallucinations.
Of course, in general, such a color shift can by itself not adequately counteract the differences between images scanned at different wet labs.
For example, a difference in staining concentration would result in a different hue for the tissue but should not affect the background.
% \\

We therefore took linear color transformations as a starting point. 
% and aim to allow just enough non-linearities in the neural network for stain normalization.
\rebutextra{Following common practices and architectures, we then incrementally explored the design space to arrive at a network that achieves good results in various settings.}
This translates to fully convolutional neural networks, exclusively composed of kernels with spatial sides of $1\times1$.
The number of layers and the number of kernels per layer is kept low, up to three convolutional layers with up to $32$ kernels per convolution.
Additionally, a residual approach is taken. The output of the neural network is added to the input colors instead of serving as the final result.
This allows regularizers on the size of the network weights to keep the learned function closer to the identity.
% \\
Beneficial byproducts of these design decisions are a low memory footprint and fast execution.

\subsection{Color distribution dissimilarity loss}\label{sec:color_loss}

% \begin{figure}[t!]
% \centering
% \includegraphics[width=0.8\textwidth]{figures/Qualitative.png}
% \caption{
% % These images provide a qualitative comparison of an adaptation provided by 1$\times$1 Stainer and related deep learning methods (original images from \todo{red}).
% Qualitative comparison of One by One Stainer and related deep learning methods (original images from  \citep{stainnet}).
% The sample is part of the test set of the Mitos \& Atypia 14 dataset, aperio domain.
% Note the color artifacts after processing by StainGAN, which, as a CycleGAN based method, is given full freedom to generate the image.
% StainNet is trained to reproduce the output of StainGAN.
% % In contrast, our method is trained directly to resemble the target domain while constraining the modification to a non-linear color function.
% % \todo{Color difference maps?}
% } \label{fig:Qualitative}
% \end{figure}


The most noticeable gap between domains in histopathology is the overall hue and occurrences of colors.
Closing this gap in color distributions is a necessary condition for adapting one domain to the other.
We therefore propose to incorporate the color likelihoods in the loss function to guide the optimization of the model.
In contrast to \citet{lee2022stain}, we do not introduce another neural network but derive losses directly from the occurrences in the target dataset and batches.
% Two alternative loss functions were considered for this purpose.

Earth mover's distance is a common metric for distribution similarity.
As colors are represented by vectors of size three, aggregating their occurrences in a dataset gives rise to a three-dimensional tensor.
Calculating the earth mover's distance between such tensors would be iterative, approximate and costly.
Instead, we propose to optimize an upper bound on this distance.
In a 1D space, the earth mover's distance can be calculated efficiently as the cumulative sum of the differences.
The sum over the distances of each color channel independently gives an upper bound over the distance between the full distributions, as the transport can be redundant and not as optimal as when all dimensions are considered concurrently.
This approximation by projecting along the RGB unit vectors is somewhat arbitrary.
We therefore take the mean over the upper bounds set by $N$ random orthonormal bases, sampled uniformly on the unit sphere, similar to  \citet{seguy2018large}.
The projected distribution dissimilarity loss function \( L_{\text{color,proj}} \) can be defined as follows:
\begin{equation}
% \[
% \\$
L_{\text{color,proj}} = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{3} \text{OT}(H_{\text{proj}, i, j}, H_{\text{target}, \text{proj}, i, j})
% $, 
% \]
\end{equation}
where \( N \) is the number of random orthonormal bases,
\( H_{\text{proj}, i, j} \) is the 1D histogram of the batch projected along the \( j \)-th vector of the \( i \)-th basis, in practice differentiably approximated as proposed by  \citet{ustinova2016learning},
\( H_{\text{target}, \text{proj}, i, j} \) is the 1D histogram of the target distribution projected along the \( j \)-th vector of the \( i \)-th basis,
and \( \text{OT} \) denotes the optimal transport cost or earth mover's distance between the projected histograms.
The model architecture and training procedure is illustrated in Figure~\ref{fig:architecture}.
% In a 1D space, this distance can be efficiently calculated as a cumulative sum of the differences.

% where:
% \begin{itemize}
%     \item \( N \) is the number of random orthonormal bases.
%     \item \( H_{\text{proj}, i, j} \) is the 1D histogram of the batch projected along the \( j \)-th vector of the \( i \)-th basis, in practice differentiably approximated as proposed by Ustinova et al. \citep{ustinova2016learning},
%     \item \( H_{\text{target}, \text{proj}, i, j} \) is the 1D histogram of the target distribution projected along the \( j \)-th vector of the \( i \)-th basis.
%     \item \( \text{OT} \) denotes the optimal transport cost or earth mover's distance between the projected histograms.
% \end{itemize}

% A second option for the color distribution dissimilarity loss is adapted from HistoGAN \citep{afifi2021histogan}.
% It utilizes a feature borrowed from color constancy literature, a 2D differentiable histogram in log-chroma space.
% The Hellinger distance between these RGB-uv histograms of output batches and the target distribution is then calculated as loss.
% One characteristic of the RGB-uv histograms is good invariance to illumination.
% While this might be desirable for natural images, in histopathology, this does not lead to the intended results.
% To remedy this loss of information, the difference in mean intensity of the RGB color values is added to the loss.
% 
% In our experiments, both alternatives for the color loss function have led to good domain adaptation.
% A rigorous comparison remains for future work.
% In our experience, the log-chroma RGB-uv loss seems more stable with respect to hyperparameters and is the loss function that was used to obtain the results shown in the rest of this text.

% \subsection{Modifying adversarial networks}
% 
% Part of the accessibility of the proposed approach is that it only relies on unsupervised learning and does not require paired images or supervision from another model.
% While the color distribution dissimilarity loss described in section \Cref{sec:color_loss} is unsupervised, in general it is insufficient to learn a good adaptation function.
% Without any information on the structure of the content of the target images, colors in the source domain can easily get mapped to the wrong colors in the target domain.
% \\
% Another technique for unsupervised matching of outputs to a target distribution is adversarial training.
% StainGAN \citep{shaban2019staingan} applied adversarial training to stain normalization. 
% Unlike StainGAN, the proposed approach does not require a cycle consistency loss to enforce information retention.
% The architecture constraints described in section \Cref{sec:conv} above limit hallucinations by design.
% Without a cycle consistency loss, there is no need for the reverse generator, which would transform modified images back to the source domain.
% With one fewer adversary in the adversarial training, which is notorious for its instability \citep{instablegan,wassersteingp,Wiatrak2019StabilizingGA}, it follows that our training method is not as sensitive to hyperparameters and initialization.
% Furthermore, as the modifying network has far fewer parameters to fit than a typical generator, it does not need as many training samples or iterations.
% \\
% In our implementation, we use a Wasserstein GAN critic with gradient penalty \citep{wassersteingp}.
% Compared to a classifying discriminator, this adversarial loss definition offers greater stability.
% Note that the architecture restriction for the modifying network does not apply to the critic.
% On the contrary, it is important that the critic has a sufficiently wide field of view to learn to recognize the structures that can be associated with images from the target domain.
% By doing so, the adversarial loss complements the color distribution loss and guides the modifying network to have colors appear with the right frequency and in the right places.
% 
% 
% The final loss can be formulated as 
% $
% L = L_{\text{critic}} + \lambda_{\text{reg}} L_{\text{reg}} + \lambda_{\text{color}} L_{\text{color}} 
% $
% with an adversarial term derived from the critic output, a regularizer on the size of the weights, and the color distribution dissimilarity loss.

% The final loss can be formulated as below, with an adversarial term derived from the critic output, a regularizer on the size of the weights, and the color distribution dissimilarity loss.
% The model architecture and training procedure is illustrated in figure \Cref{fig:architecture}.

% \begin{equation}\label{eq:loss}
% L = L_{\text{critic}} + \lambda_{\text{reg}} L_{\text{reg}} + \lambda_{\text{color}} L_{\text{color}} 
% \end{equation}


\begin{figure*}[t!]
    \centering
    \subfigure[original]{
        \includegraphics[width=0.22\textwidth]{figures/bio1_method_comparison_crop/original.png}
    }
    \subfigure[StainGAN]{
        \includegraphics[width=0.22\textwidth]{figures/bio1_method_comparison_crop/StainGAN.png}
    }
    \subfigure[ContriMix]{
        \includegraphics[width=0.22\textwidth]{figures/bio1_method_comparison_crop/contriMix.png}
    }\\
    \subfigure[StainFuser]{
        \includegraphics[width=0.22\textwidth]{figures/bio1_method_comparison_crop/StainFuser.png}
    }
    \subfigure[StainNet]{
        \includegraphics[width=0.22\textwidth]{figures/bio1_method_comparison_crop/StainNet.png}
    }
    \subfigure[ours]{
        \includegraphics[width=0.22\textwidth]{figures/bio1_method_comparison_crop/ours.png}
    }
    \caption{ Normalization by different models trained for CAMELYON16 data.
    To illustrate the risks when used out of distribution, the input image is taken from our proprietary dataset.
    StainGAN and StainFuser show hallucinations in the upper left corner.
    Contrast is reduced by ContriMix.
    StainNet maps the yellow specks to similar purples as in the rest of the image.
    Full versions in the Appendix.
    }
    \label{fig:hallucination}
\end{figure*}



% \subsection{Comparison and stopping criterion: converged discriminator}
% \subsection{Comparison and stopping: converged discriminator}
% 
% %Why do you need this?
% Generative adversarial networks are notorious for their instability, sensitivity to hyperparameters and mode collapse.
% Limiting the imitating adversary to modifications of pixels, keeping both models small, and the additional guidance from color distributions, already alleviate many of these issues.
% Still, in its purely unsupervised formulation, there is no metric to gauge success during adversarial training.
% \\
% %What is it?
% %A discriminator, as in the original GAN
% %Trained until convergence (normally not done because of gradients)
% %To form a decision boundary around the target distribution
% On an intuitive level, the reason for adversarial training is clear: observers of the output should not be able to tell whether the image was processed by the model or taken from the target domain.
% In practice, the observer of the output is often another neural network.
% The goal is then to have this downstream network treat the modified data as if it originated from the target domain.
% If the downstream network cannot differentiate the outputs from the target domain, the outputs can not be treated differently.
% If a neural network with sufficient capacity and specifically trained for the detection of target domain images cannot differentiate the outputs, it is unlikely that downstream networks will.
% % That is why, though optional, we train a discriminator to convergence on the target domain before training a 1$\times$1 Stainer model.
% That is why, though optional, we train a discriminator to convergence on the target domain before training a \ours model.
% \\
% For simplicity, the architecture of the discriminator is kept the same as the critic.
% Its task is to classify target domain images as either unmodified or color transformed by a random, linear color transformation.
% To sample the color transformations, first a uniformly random 12D unit direction vector is generated.
% Next, an amplitude is sampled from a Gaussian distribution with a given mean and standard deviation so that $0$ is at 4 deviations.
% In our experiments, $0.5$ is used as the mean.
% The direction vector scaled by the amplitude is taken as deviation from the identity transformation and reshaped as the 12 parameters of a 4$\times$4 3D affine transformation matrix.
% The modified negative training samples guide the discriminator to form a decision boundary around the target distribution.
% Finally, besides monitoring the training cross entropy, the mean Average Precision is evaluated over a holdout target domain set and source domain data, to check for convergence.
% \\
% After convergence, the discriminator is frozen. 
% % In theory the output of the discriminator could be used as a loss for the modifying network.
% % In practice, the gradients from such a loss would be too noisy as the discriminator is already trained to convergence.
% % It can however be used to evaluate when a modifying network has reached sufficient performance.
% It can be used to evaluate when a modifying network has reached sufficient performance.
% It can also be applied to compare training runs between the same domains.
% In this way, the converged discriminator provides a stopping criterion for the training of the modifying network and removes the ambiguity of adversarial training.

%How is it implemented?
%Same architecture as the critic
%Trained on target data, with half of it modified by random, linear color changes.
%Gaussian sampled around the identity
