\section{Experiments and Results}
% Intro 
In this section, we start by introducing the dataset, artifacts, and backbones. Second, we evaluate artifact detection methods for each artifact. Next, we quantify the restoration qualities of our model with ground truth artifact masks and our localization maps. The most important ability for artifact restoration is its application in the clinical workflow for computational pathology. For this reason, we evaluate a state-of-the-art downstream model on artifact images with and without the use of HARP and conduct a user study to determine whether there is a difference between a clean image and the outputs of HARP. 

\textbf{Dataset and Artifacts:}
We evaluate all methods of the Breast Cancer Semantic Segmentation dataset (BCSS)~\cite{amgad2019structured}, for which we use an FCN8 architecture proposed in the paper for the downstream task evaluation. While BCSS contains a multitude of labels, we focus on the four predominant classes (\% of labels): tumor ($45\%$), stroma ($17\%$), lymphocyte-rich tissue ($35\%$) and necrosis ($3\%$). We train our inpainting denoising diffusion model and the downstream task model using 11075 training, 1031 validation and withhold 1000 test patches. The diffusion model backbone is based on the guided diffusion model from \url{github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models}. From each region of interest within a WSI, we sample random crops of size 600x600 with 0.24 mpp, which we resample to 256x256. We train our models using an NVIDIA RTX 4090. Our code is available here: \url{github.com/MECLabTUDA/HARP} . % TODO get numbers
Leveraging previous works~\cite{stieber2022FrOoDo}, we adopted the following artifacts: \textbf{dark spots}, \textbf{fat drop}, \textbf{squamous epithelia}, \textbf{threads}, \textbf{blood cell} and \textbf{blood group}, \textbf{compression}, \textbf{cuts}, \textbf{overlap} and \textbf{folding}, which have been shown to be realistic and detrimental to downstream performance by~\cite{wang2021stress,babendererde2023jointly}. We generate 100 samples for each artifact.

\textbf{Artifact Detection:}
The first step to deploying artifact restoration efficiently in a clinical workflow is to detect artifacts reliably. We evaluate various methods from AnomaLib~\cite{akcay2022anomalib} on the 4 local artifacts from~\citet{schomig2021quality}, which can be found in the appendix. From these methods, we chose the best three methods: DRÆM~\cite{zavrtanik2021draem}, FastFlow~\cite{yu2021fastflow} and STFPM~\cite{wang2021student} and validate them using the validation set to determine the quantile for which each method keeps $95\%$ of normal samples (dashed lines). Figure~\ref{fig:ArtifactDetection} shows how this affects the detection of all artifact types. We select the FastFlow method for the pipeline, as it only misses most artifacts from squamos epithelia and has the best accuracy with $85.8\%$. DRÆM and STFPM are strong competitors; DRÆM misses most of the artifacts for squamous epithelia, cut, and overlap.  DRÆM average accuracy is $83.9\%$. STFPM, on the other hand, misses squamos epithelia, thread, and cut artifacts with an average accuracy of $74.9\%$.
\begin{figure}[t]
    \centering
    \includegraphics[width=0.75\textwidth]{figures/artifact_ADresults.png}
    \caption{\textbf{Evaluation of Artifact Detection per Artifact Type}} 
    \label{fig:ArtifactDetection}
\end{figure}

\begin{table}[b]
\centering
\caption{Image quality metrics for the artifact restoration of the first five artifacts.}
\begin{adjustbox}{max width=\linewidth}
  {\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|}
  \hline
   \bfseries Artifact:  & \multicolumn{2}{c|}{\bfseries Dark Spot} & \multicolumn{2}{c|}{\bfseries Squamos Epi.}    & \multicolumn{2}{c|}{\bfseries Thread} & \multicolumn{2}{c|}{\bfseries Blood Cells} & \multicolumn{2}{c|}{\bfseries Blood Group}\\\hline
   \bfseries Metric:  & FID$\downarrow$ & PSNR$\uparrow$ & FID$\downarrow$ & PSNR$\uparrow$   & FID$\downarrow$ & PSNR$\uparrow$ & FID$\downarrow$ & PSNR$\uparrow$  & FID$\downarrow$ & PSNR$\uparrow$ \\\hline
    \multicolumn{11}{|c|}{\bfseries Supervised methods with Ground Truth Masks}\\\hline
    \bfseries ArtiFusion    & 78.3          & 19.6          & 25.7          & \textbf{24.4} & 40.4          & 21.3          & 48.3          & 22.7          & 54.1          & 20.7          \\\hline
    \bfseries AR-CycleGAN   & 98.4          & 17.0          & 77.8          & 19.4          & 114.2         & 17.7          & 179.3         & 17.3          & 130.6         & 16.9          \\\hline
    \bfseries DDRM          & 85.8          & \textbf{20.6} & 38.1          & 24.2          & 51.5          & \textbf{21.4} & 65.6          & 21.8          & 61.9          & 20.8          \\\hline
    \bfseries Ours          & \textbf{70.3} & 18.9          & \textbf{25.2} & 24.1          & \textbf{34.2} & 21.2          & \textbf{33.8} & \textbf{23.7} & \textbf{40.1} & \textbf{21.1} \\\hline
    \multicolumn{11}{|c|}{\bfseries Unsupervised method without Ground Truth Masks}\\\hline
    \bfseries HARP (Ours)   & \textbf{64.3} & 17.2          & 50.2          & 20.6          & 44.7          & 19.4          & 192.0         & 16.3          & 86.4          & 16.8          \\\hline
  \end{tabular}}
\end{adjustbox}
\label{tab:ArtifactRestoration1}
\end{table}

\textbf{Artifact Restoration Quality:}
Now, we want to assess the image quality generated by the artifact restoration model and the full HARP pipeline. As HARP is, to the best of our knowledge, the first fully unsupervised method, we compare to three methods that either \textbf{require artifact images during training} like AR-CycleGAN~\cite{ke2023artifact} or \textbf{manual input by providing localization masks} like ArtiFusion~\cite{he2023artifact} and DDRM~\cite{kawar2022denoising}. 
To provide localizations reliably to all models, we leverage the ground truth segmentation masks of the artifacts from~\citet{stieber2022FrOoDo}. We train ArtiFusion and DDRM 
on the BCSS with the same backbone architecture as our model. Further, as \textbf{AR-CycleGAN requires a dataset representing artifacts}, we used the original dataset and deployed it to our test cases. We evaluate 100 artifact images per artifact type in Table~\ref{tab:ArtifactRestoration1} and \ref{tab:ArtifactRestoration2} for Fréchet Inception Distance (FID $\downarrow$) and Peak Signal Noise Ratio (PSNR $\uparrow$), where the arrow indicates the direction of improvement. Our artifact restoration model performs best for \textbf{13 out of 20 results}, and DDRM offers minimal improvements on 5 of the metrics and ArtiFusion on 1. Ours has a \textbf{reduced runtime of 18.6 sec.} per image vs. 30.9 sec. for ArtiFusion and 37.4 sec. for DDRM. AR-CycleGAN fails to generalize to the unseen artifact domains, demonstrating supervised training limits. When looking at our fully unsupervised HARP method, we see that it can even improve on the dark spot artifact, which is likely due to a better mask. Evaluating the artifact localizations and the ground truth masks by DICE, HARP achieves $54.5\%$. However, HARP limitations are when the localizations are suboptimal, e.g., for blood cells and groups. This is likely due to one of three reasons: First, the artifact detection failed. Second, these artifacts are often spread out over the whole image, contrary to other artifacts. Third, the training set likely contains some artifacts, leading to a reproduction of the same artifacts in the image. However, when the segmentation mask is appropriate, the results are excellent, as supervised results show. The key area for improvement lies in the quality of unsupervised localization masks, as they significantly contribute to the artifact restoration process.

\begin{table}[t]
\centering
\caption{Image quality metrics for the artifact restoration of the last five artifacts.}
\begin{adjustbox}{max width=\linewidth}
  {\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|}
  \hline
   \bfseries Artifact:  & \multicolumn{2}{c|}{\bfseries Compression} & \multicolumn{2}{c|}{\bfseries Cut}    & \multicolumn{2}{c|}{\bfseries Air Bubble} & \multicolumn{2}{c|}{\bfseries Overlap} & \multicolumn{2}{c|}{\bfseries Folding}\\\hline
   \bfseries Metric:  & FID$\downarrow$ & PSNR$\uparrow$ & FID$\downarrow$ & PSNR$\uparrow$   & FID$\downarrow$ & PSNR$\uparrow$ & FID$\downarrow$ & PSNR$\uparrow$  & FID$\downarrow$ & PSNR$\uparrow$ \\\hline
    \multicolumn{11}{|c|}{\bfseries Supervised methods with Ground Truth Masks}\\\hline
    \bfseries ArtiFusion    & 44.5          & 22.2          & 46.1          & 21.9          & 54.9          & 19.0          & 29.5          & 23.2          & 41.3          & \textbf{20.8} \\\hline
    \bfseries AR-CycleGAN   & 129.9         & 18.4          & 124.6         & 17.5          & 145.8         & 17.5          & 86.4          & 17.9          & 125.9         & 16.8          \\\hline
    \bfseries DDRM          & 52.8          & \textbf{22.5} & 53.2          & \textbf{22.3} & 69.1          & \textbf{19.4} & 41.0          & 23.1          & 50.8          & \textbf{20.9} \\\hline
    \bfseries Ours          & \textbf{38.1} & \textbf{22.4} & \textbf{43.9} & 22.0          & \textbf{44.9} & 18.7          & \textbf{26.7} & \textbf{23.5} & \textbf{35.4} & 20.4          \\\hline
    \multicolumn{11}{|c|}{\bfseries Unsupervised method without Ground Truth Masks}\\\hline
    \bfseries HARP (Ours)   & 65.3          & 17.9          & 60.5          & 19.0          & 66.3          & 17.1          & 51.6          & 18.0          & 69.6          & 17.4 \\\hline
  \end{tabular}}
\end{adjustbox}
\label{tab:ArtifactRestoration2}
\end{table}

\textbf{Downstream Application:}
To evaluate the usability of HARP for the clinical workflow of computational pathology, we evaluate the segmentation performance of the downstream model using clean images, artifact images, artifact images excluding artifact segmentations, and restored images with HARP. These images all have the same underlying image and segmentation mask from the test, for which we calculate the DICE score per class and the average. It is important to note that AI-generated images pose a risk for accurate diagnoses, similar to undisclosed deepfakes; therefore, we ensure transparency in our process by excluding these contents using HARP's artifact localization. As seen in Table~\ref{tab:ArtifactDownstream}, the performance of the artifact images significantly decreases compared to the originally clean images. HARP is able to recover the artifact images and effectively reduces the performance drop introduced by the artifacts by $48\%$, which makes the downstream model more robust and reliable for the daily clinical workflow.

\begin{table}[t]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
 \centering
%\floatconts
  %{tab:ArtifactDownstream}%
  {\caption{Downstream performance of state-of-the-art segmentation model on BCSS for clean, artifact and images restored with HARP.}}%
  \begin{adjustbox}{max width=0.8\linewidth}
  {\begin{tabular}{|l||c|c|c|c||c|}
  \hline
   \bfseries Metric:                & \multicolumn{5}{c|}{\bfseries DICE {\%}}\\\hline
   \bfseries on:                    & Tumor                 & Stroma                & Lymphocyte-rich           & Necrosis              & Average    \\\hline
    \bfseries Clean                 & $86.1\pm0.4$            & $83.8\pm0.7$            & $81.8\pm2.1$                & $74.2\pm2.8$            & $81.5\pm1.1$       \\\hline
    \bfseries Artifacts             & $77.7\pm0.3$            & $77.9\pm0.9$            & $76.2\pm2.3$                & $64.9\pm2.6$            & $74.2\pm1.0$       \\\hline
    \bfseries Artifacts wo. seg     & $80.6\pm0.4$            & $81.1\pm0.8$            & $77.9\pm2.4$                & $68.6\pm2.9$            & $77.0\pm1.1$       \\\hline
    \bfseries HARP (Ours)           & $82.2\pm0.5$            & $82.0\pm0.9$            & $78.5\pm2.3$                & $69.3\pm2.8$            & $78.0\pm1.0$       \\\hline
  \end{tabular}}
  \end{adjustbox}
  \label{tab:ArtifactDownstream}
\end{table}

\textbf{User Study:}
Finally, we conducted a user study with four pathologists on 50 image pairs as a visual turning test on the produced image quality. One of the images from the pair is a normal image from the training distribution, and the other is an image from the test distribution augmented with an artifact and then processed with HARP. We use 5 artifact images from each of the 10 artifact types. The pathologists were given instruction to conduct the study on 256x256 images with $100\%$ scale and not to zoom to avoid image interpolation artifacts from the preprocessing of all images affecting the study. The study was timed in order to ensure that participants followed a standard clinical workflow. We calculate the Matthews correlation coefficient (MCC) for each participant and give the number of falsely classified samples. The participants achieved the following scores: -0.071 MCC (27/50), -0.159 MCC (30/50), 0.239 MCC (20/50), and 0.296 MCC (18/50) with the times 7:34 min, 5:50 min, 4:45 min, and 7:00 min, respectively. All our participants found it impossible to tell the real difference between images, as our results suggest, at best, there is a weak positive correlation by chance. This further demonstrates the potential of HARP and the risks of not disclosing generated content for the clinical workflow. We give more results and a sample of five images from the study in the appendix. %We encourage the reader to try it out; the correct answers are given as a spoiler.

\textbf{Dangers and Impact:}
Generative AI can not go unlabeled – as the EU AI Act~\cite{eu_AIAct_2016} suggests. In pathology, Generative AI risks misleading the diagnosis done by pathologists and other AIs, which we showed with our user study. Therefore, we exclude the artifact localization masks in the downstream evaluation and recommend highlighting them. Nonetheless, computational pathology has the promise of saving time and increasing diagnostic accuracy for patients, for which HARP is a supportive structure to ensure reliability. Further, it has the dual benefit of improving the image quality without rescanning. %However, we encourage future research into this topic to improve the reliability of restoration and more efficient and broadly applicable tools to achieve a trustworthy AI that revolutionizes the daily clinical routine. %, and that’s why we make our code available upon acceptance.