
\section{Introduction}

The trend towards ever-more-complex machine learning models, such as deep neural networks, has led to improved accuracy at the cost of interpretability. Most state-of-the-art models are inherently opaque, making understanding their inner dynamics and decision-making processes difficult. The causality underlying the model's decision-making remains unclear, which has motivated the emerging research area of explainable AI \cite{das2020opportunities}.

Due to the intuitive nature by which visual heatmaps can communicate explanatory information, numerous such methods have been developed in the computer vision domain.
In this paper, we try to reproduce and extend the results of CartoonX (Cartoon Explanations)~\cite{cartoonX}.
CartoonX is a novel, model-agnostic explanation method for image classifiers.
What separates CartoonX from conventional explanatory methods is that it operates in the wavelet domain instead of the pixel domain.
This paradigm shift is motivated by the idea that demanding sparsity in the wavelet domain introduces piece-wise smooth explanations, i.e. asking the question \textit{What is the piece-wise smooth part of the input signal that leads to the model decision?}~\cite{cartoonX}.
Conventional methods (e.g. Smoothgrad~\cite{smilkov2017smoothgrad} or LIME~\cite{ribeiro2016lime}), which operate in the pixel domain, produce pixel-sparse or jittery explanations.
Piece-wise smooth explanations are comparatively easier to interpret for a human.

CartoonX generates explanations for a decision by applying the rate-distortion explanation (RDE) framework (originally proposed by~\cite{rdeOriginal} and extended by~\cite{rdeExtend}) in the wavelet domain of an image.
RDE is an optimization problem that enforces maximum sparsity with minimum distortion in the model's output.
This is achieved by optimizing a deletion mask applied to the pixel coefficients, thus marking relevant pixels.
In the conventional setting, Pixel RDE enforces sparsity on (super-)pixels.
In the case of CartoonX, RDE is used to enforce sparsity across wavelet coefficients.
Hence, the deletion mask is applied in the wavelet domain. 

\section{Scope of reproducibility}
\label{sec:claims}

In this reproducibility study, we examine the original authors' claims~\cite{cartoonX} that:
\begin{itemize}
\setlength\itemsep{-0.1em}
    \item CartoonX extracts relevant piece-wise smooth parts of the image, resulting in more straightforward explanations for humans.
    \item CartoonX achieves lower distortion in the model output while using fewer coefficients than other state-of-the-art methods.
    \item CartoonX is model-agnostic.
\end{itemize}

% Explain first two claims
The original authors have shown that CartoonX is agnostic to different CNNs.
We extend their experiments and investigate whether their method is also agnostic to Vision Transformers (ViT).
Using a ViT, the results of CartoonX can also be compared with the attention map extracted from the attention weights of the Transformer model.
Attention maps have been used in other works to explain model decisions~\cite{chefer2021transformer}.
Therefore, they serve as a sound basis for the cross-validation of the results generated by CartoonX. 

Furthermore, we demonstrate that improving CartoonX's runtime via enhanced initialization techniques of the deletion mask is difficult. 
Due to the computational cost associated with generating sufficient training data for a neural-network-based approach, two different heuristical strategies were assessed.
Our results indicate that simple feature engineering might be insufficient to achieve competitive runtimes.
As the original authors suggested before us, it may be necessary to use more complex methods such as a neural network to find a good initialization.
Such a network could be trained to suggest an initial deletion mask for arbitrary images of the target distribution.

% Add a section where we state the contributions of our paper. Thus Reproducing the results, initialization extension and vit Here or after scope of reproducibility?

% I think this part should probably be skipped. It is a nice addition, but no space

Summarizing, the main contributions of this publication are:
\begin{itemize}
\setlength\itemsep{0.1em}
    \item Reproducing the quantitative and qualitative experiments of the original paper and evaluating the first two claims made by the authors
    \item Examining the model agnosticism claim by running CartoonX with a ViT.
    \item Exploring the approach of using simple heuristics for initialization to improve runtime.
\end{itemize}