%\documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Our Custom
\usepackage{graphicx}

\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}

\usepackage[capitalize]{cleveref}
\usepackage{algorithm}
\usepackage{algorithmic}

\usepackage{multirow}
\usepackage{colortbl}

\usepackage{pifont}
\usepackage{xcolor}
\usepackage{xspace}

% ============ custom var ================
\newcommand{\R}{\mathbb{R}}
\newcommand{\mymethod}{Contrast-CAT\xspace}
\newcommand{\cmark}{\textcolor{black}{\ding{51}}}%
\newcommand{\xmark}{\textcolor{black}{\ding{55}}}% 55? 53?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\title{Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
%\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2025 paper}{Sungmin Han}{}}
\author[1]{Sungmin Han}
\author[1]{Jeonghyun Lee}
\author[1]{Sangkyun Lee\thanks{Corresponding author}}
% Add affiliations after the authors
\affil[1]{%
    School of Cybersecurity, Korea University, Seoul, South Korea
    \texttt{\{sungmin\_15,nomar0107,sangkyun\}@korea.ac.kr}
}

\begin{document}
\maketitle

\begin{abstract}
Transformers have profoundly influenced AI research, but explaining their decisions remains challenging -- even for relatively simpler tasks such as classification -- which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of $\times 1.30$ in AOPC and $\times 2.25$ in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.
\end{abstract}

\section{Introduction}\label{sec:intro}

\begin{figure}[tb]
\centering
\includegraphics[width=0.435\textwidth]{figs/motivate.pdf}
\caption{Heatmaps displaying attribution values from different encoder layers of the BERT$_{\text{base}}$ model for a negative review prediction. Panel A shows maps generated by AttCAT, which applies gradients directly to activations, while Panel B shows maps from \mymethod, which applies gradients to activation contrast information. Values closer to 1 (red) indicate stronger contributions to the negative prediction.}
\label{fig:motivation}
\end{figure}

Transformers~\citep{transformer} have achieved remarkable success in recent years, transcending both academic and industrial boundaries and becoming increasingly integrated into daily life. However, this widespread integration also heightens the risk of direct exposure to AI errors, underscoring the need to ensure the safety, security, and trustworthiness of AI systems through increased transparency~\citep{biden,NIST_Trustworthy_and_Responsible_AI_100_5,eu_ai_act_2024}. Consequently, developing methods for interpreting the decision-making processes of transformer-based models has become essential.

To address this need, numerous methods have been proposed for interpreting transformer-based models, particularly in text classification tasks where they have shown remarkable performance. These methods generate attribution maps that indicate the relative contributions of input tokens to a model’s decisions. In Section~\ref{sect:related}, we categorize them into attention-based, LRP-based, and activation-based approaches. This work focuses on activation-based attribution, which leverages a model’s activation information to produce attribution maps and has demonstrated state-of-the-art performance in attribution quality.

Activation-based attribution maps are typically derived by extracting activations from one or more layers of a neural network for a given input sequence. Then, the output gradient of the target class, with respect to these activations, is applied to isolate class-relevant features~\citep{GradCAM}. However, we find that this procedure can still be influenced by class-irrelevant signals within the activations, thus limiting its ability to produce accurate, class-specific interpretations.
For example, in \Cref{fig:motivation}, panel (A) illustrates attribution maps generated by AttCAT~\citep{Attcat}, one of the leading activation-based attribution methods, for the movie review `It is very slow.', which is classified as negative. Ideally, the word `slow' should register as highly relevant, with a positive attribution value in relation to the negative sentiment. However, AttCAT fails to detect this importance, whereas our proposed method, \mymethod, correctly assigns the highest attribution to `slow.'

In this paper, we introduce \mymethod, a novel activation-based attribution method for transformer-based text classification. We find that existing methods often incorporate class-irrelevant signals, compromising attribution accuracy. By contrasting target activations with multiple reference activations, \mymethod filters out these irrelevant features and produces high-quality token-level attribution maps. Extensive experiments show that \mymethod consistently outperforms state-of-the-art approaches, achieving average improvements of $\times 1.30$ and $\times 2.25$ in AOPC and LOdds under the MoRF setting, and $\times 1.34$ and $\times 1.03$ under the LeRF setting, compared to the best competitors.

\section{Related Work}\label{sect:related}

We describe attribution methods for interpreting transformer-based text classification models, categorizing them into attention-, LRP-, and activation-based approaches.

\paragraph{Attention-based Attribution}
Attention-based attribution methods rely on attention scores, a key component of transformers.
Under the assumption that input tokens with high attention scores significantly influence model outputs, numerous studies~\citep{attn_xai_2,attn_xai_5,Rollout,globenc,rollout2} have employed attention scores for interpretative purposes of a model.
Specifically, \citet{Rollout} proposed Rollout, which integrates attention scores across multiple layers while accounting for skip connections in transformer architectures to capture information flow.
Additionally, there have been many papers~\citep{attn_grad_xai_2,Gradsam} that introduce the gradient of attention weight for interpretation.
Despite advances in attention-based methods, significant debate remains about whether attention scores truly reflect the relevance of model predictions, as highlighted in~\citep{attention_is_not_explain, attention_not_not_explain}.

\paragraph{LRP-based Attribution}
Layer-wise relevance propagation (LRP)~\citep{LRP_bach} is a technique for backpropagating relevance scores through a neural network, with the scores reflecting our specific interest in the model's prediction.
Building on LRP, several studies have derived explanations for model behavior~\citep{CLRP,PartialLRP,Transatt}.
In~\citep{PartialLRP}, LRP was partially used to determine the most important attention heads within a specific transformer's encoder layer, utilizing relevance scores for the attention weights. 
\citet{Transatt} introduces TransAtt, which propagates relevance scores through all layers of a transformer, combining these scores with gradients of the attention weights and utilizing the Rollout technique for multi-layer integration.
However, LRP-based methods are limited by certain assumptions, known as the LRP rules, designed to uphold the principle of relevance conservation~\citep{lrp_rule}. 

\paragraph{Activation-based Attribution}

In contrast to the methods discussed above, activation-based attribution primarily relies on activation information from each layer of a transformer model. These methods are based on core ideas originally developed for convolutional neural networks (CNNs), which have been shown to be effective for generating high-quality interpretations with simple implementations and broad versatility~\citep{GradCAM,ScoreCAM,activation_noise0,LibraCAM}.
In~\citep{Attcat}, the authors introduced AttCAT as the first adaptation of Grad-CAM~\citep{GradCAM}, one of the most popular activation-based methods for CNNs, to interpret the decisions of transformer-based text classification models. 
AttCAT generates token-level attribution maps by merging activations and their gradients in relation to the model's predictions, following Grad-CAM's essential approach, which uses gradients to reflect class-relevant information.
Similarly, \citet{TIS} introduced TIS adapting Score-CAM~\citep{ScoreCAM}: TIS uses the centroids of activation clusters identified from the activation from all layers to compute relevance scores in a manner akin to Score-CAM.

Although there are attribution methods for transformer-based text classification models that use gradients to extract class-relevant features from activations, no approach has yet focused on filtering out class-irrelevant features through activation contrasting to improve token-level attribution quality.


\section{Preliminary}

\paragraph{Problem Statement}

Consider a pre-trained transformer-based model as a function $f$ processing input tokens $x := \{x_{i}\}_{i=1}^{T}$, where $T$ is the length of the input sequence, and each token is denoted as $x_{i} \in \R^{n}$. 
Our objective is to generate a token-level attribution map $I(x):= \{I(x)_{i}\}_{i=1}^{T}$, where $I(x)_{i}$ represents the relevance score of each input token $x_{i}$ regarding the output $f(x)$.

\paragraph{Transformers}

Let us consider a transformer-based model which is composed of $L$ stacked layers of identical structure. 
We denote that the $\ell$-th layer outputs an activation sequence $A^{\ell}:= \{A^{\ell}_{i}\}_{i=1}^{T}$ that corresponds to input tokens, where $A^{\ell}_{i} \in \R^{n}$.
Each layer computes its output by combining the output from the attention layer with the previous layer's activation, where the attention layer calculates the attention scores:
\begin{equation}\label{eq:transformer.attention}
\begin{aligned}
     \alpha^{\ell,h} := \text{softmax}\left( Q^{\ell,h}(A^{\ell-1}) \cdot K^{\ell,h}(A^{\ell-1})^{T}/\sqrt{d} \right).
\end{aligned}
\end{equation}
Here, $Q^{\ell,h}(\cdot)$, $K^{\ell,h}(\cdot)$, and $V^{\ell,h}(\cdot)$ are the transformations for computing the query, key, and value of the $\ell$-th layer's $h$-th head, respectively, and $d$ is a scaling factor.
$\alpha^{\ell,h} \in \R^{T \times T}$ refers to the attention map of the $h$-th head, which contains attention scores, where $h= 1\dots H$.
We denote by $\tilde A^{\ell,h}$ the output of the $h$-th attention head in the $\ell$-th layer:
\begin{equation*}
\tilde A^{\ell,h} := \alpha^{\ell,h} \cdot V^{\ell,h}(A^{\ell-1}).
\end{equation*}
The outputs from multiple attention heads are concatenated and then combined using a fully connected layer with the skip connection: $\hat A^{\ell} := \text{Concat}(\tilde A^{\ell,1},\tilde A^{\ell,2}, \dots, \tilde A^{\ell,H}) \cdot \tilde W^{\ell} + A^{\ell-1},$
where $\tilde W^{\ell}$ is the weight of the fully connected layer.
Finally, the $\ell$-th layer's output $A^{\ell} \in \R^{T \times n}$ is computed using a feed-forward layer and skip connection:
\begin{equation}\label{eq:transformer.active}
\begin{aligned}
    & A^{\ell} =  \hat A^{\ell} \cdot W^{\ell} + \hat A^{\ell},
\end{aligned}
\end{equation}
where $W^{\ell} \in \R^{n \times n}$ is the weight for the feed-forward layer. 
We have omitted bias parameters and layer normalization in the above expressions for simplicity.

\section{Methodology}\label{sec:mymethod}

\begin{figure*}[tb]
    \centering
    \includegraphics[width=0.89\textwidth]{figs/overview.pdf}
    \caption{Construction of \mymethod's attribution map. For an input token sequence $x$, \mymethod\ computes an attribution map $I_{R}(x)$ by contrasting the \emph{target activation} $A$ (black) with a \emph{reference activation} $R$ (blue), then weighting by gradients (red) and attention (yellow).}
    \label{fig:overview}
\end{figure*}

We introduce \mymethod, a \emph{token-level}, \emph{activation-based} attribution framework tailored to \emph{transformer} models.

\subsection{Attribution Map}\label{sec:attribution-map}

Let $x := \{x_i\}_{i=1}^T$ be a sequence of $T$ tokens, and let $f_c(x)$ denote the model's score for the target class $c$. For each token $x_i$ ($i=1,\dots,T$), \mymethod\ defines its \emph{attribution} with respect to a \emph{contrastive reference} $R$ as:
\begin{equation}
\label{eq:contrastcat}
I_{R}(x)_{i}
\;:=\;
\sum_{\ell=1}^{L}
\hat{\alpha}^{\ell}_{i}
\sum_{j=1}^{n}
\Bigl(
  \tfrac{\partial f_c(x)}{\partial A^\ell_{i}}
  \;\odot\;
  \bigl(A^\ell_{i} - R^\ell_{i}\bigr)
\Bigr)_{j}.
\end{equation}
Here,
\begin{itemize}
   \item $A^\ell_{i} \in \mathbb{R}^n$ is the activation for token $x_i$ at layer~$\ell$,
   \item $\tfrac{\partial f_c(x)}{\partial A^\ell_{i}} \in \mathbb{R}^n$ is the gradient of $f_c(x)$ w.r.t. $A^\ell_{i}$,
   \item $R^\ell_{i}$ is a \emph{reference activation} for token $i$ chosen from a reference token sequence $r$ such that $f_c(r)<\gamma$,
   \item $\odot$ denotes element-wise multiplication,
   \item $\hat{\alpha}^\ell_i$ is the \emph{averaged attention} of token~$i$ at layer~$\ell$.
\end{itemize}
In essence, $\bigl(A^\ell_i - R^\ell_i\bigr)$ \emph{contrasts} the target activation against one that does not strongly activate class~$c$, thereby removing non-target signals (class-irrelevant features). The factor $\tfrac{\partial f_c(x)}{\partial A^\ell_{i}}$ highlights the parts of the activation that actually affect the model's output, while $\hat{\alpha}^\ell_i$ weights these elements by how much the transformer attends to token~$i$.

Figure~\ref{fig:overview} provides a simplified illustration of the attribution map construction process for \mymethod.

\subsection{Component Details and Motivation}
\label{sec:component_details}

\paragraph{Token-Level Activations $\boldsymbol{A^\ell_i}$.}
Transformers represent each token $x_i$ as a vector in each layer $\ell$. By working at the \emph{token level}, \mymethod\ directly captures the discrete, context-dependent nature of language—differentiating it from CNN-based attribution methods initially designed for spatial feature maps.

\paragraph{Gradients $\boldsymbol{\tfrac{\partial f_c(x)}{\partial A^\ell_i}}$.}
Inspired by gradient-based interpretations, we leverage the partial derivative of $f_c(x)$ w.r.t. $A^\ell_i$. This follows general insights from activation-based methods, (e.g., \citep{GradCAM}), ensuring that only components of $A^\ell_i$ that genuinely influence $f_c(x)$ are emphasized.

\paragraph{Activation Contrasting $\boldsymbol{A^\ell_i - R^\ell_i}$.}
A key novelty of \mymethod\ is its \emph{contrast} operation, which computes the difference between a target activation $A^\ell_i$ and a \emph{low-activation} reference $R^\ell_i$. The reference $R^\ell_i$ is chosen from a sequence $r$ such that $f_c(r)<\gamma$, where $\gamma$ is a pre-defined small positive number ($\gamma >0$).
This choice ensures that the reference activation has a minimal response to the target class $c$ (we set $\gamma=10^{-3}$ in our experiments).
While the use of reference or baseline activations is broadly motivated by prior works (e.g., \citep{LibraCAM}), \mymethod\ is the first to extend this idea to transformer-based text classification networks, applying it \emph{across multiple transformer layers}, at the \emph{token level}, explicitly targeting textual data. This operation highlights class-specific features that distinguish $x$ from a weakly activating example.

\paragraph{Attention Weights $\boldsymbol{\hat{\alpha}^\ell_i}$.}
Transformers distribute relevance across tokens via multi-head attention. We aggregate these attention scores into $\hat{\alpha}^\ell_i$, giving higher importance to tokens that the model itself regards as salient. Unlike purely attention-based methods (e.g., \citep{Rollout}), \mymethod\ integrates attention and gradient-based cues, offering a more robust attribution signal.

\paragraph{Multi-Layer Attribution}
Building on prior findings that transformers encode varying levels of semantic information across their layers---ranging from phrase-level details to deeper semantic features~\citep{level_of_semantics3,level_of_semantics1,emnlp_review_telling_bert}---we diverge from traditional activation-based attribution methods which typically rely on a single layer (e.g., \citep{Gradsam}). Instead, we incorporate \emph{multi-layer} activations $A^\ell$ from all layers $\ell = 1, \dots, L$ in \cref{eq:transformer.active}, together with their layer-wise attention scores $\alpha^{\ell,h}$ in \cref{eq:transformer.attention}. This design captures \emph{layer-specific} token semantics, and by weighting them with $\hat{\alpha}^\ell_i$, it effectively highlights the tokens most influential to the model's output across all layers.

\subsection{Attribution with Multiple Contrast}
\label{sec:multi-references}

Relying on a \emph{single} reference from one class can be insufficient if the target activations $A^\ell := \{A^{\ell}_{i}\}_{i=1}^{T}$ encode features shared across \emph{multiple} non-target classes. Moreover, any features that consistently remain after contrasting $A^\ell$ with several reference activations are more likely to represent class-specific properties. To address this, we generate a collection of attribution maps 
\begin{align*}
    D \;:=\;
  \bigl\{
    I_{R(r)}(x)
    \;\bigm|\;
    r \in \text{training set},\;
    f_{c}(r) < \gamma
  \bigr\},
\end{align*}
by repeating the procedure in Section~\ref{sec:attribution-map} with \emph{multiple} reference sequences. We cache these reference activations---one might call it a \emph{reference library}---for use during inference. In practice, we employ $30$ pre-computed references per class.

\paragraph{Refinement via Deletion Test}
\label{method.selectively_fitering_multicont}
Although this multi-reference approach reduces the risk of overlooking crucial class-relevant features, not all resulting maps $I_{R(r)}(x)$ are guaranteed to be reliable. We therefore \emph{refine} \mymethod by examining each map’s \emph{attribution quality} using a token-wise deletion test~(e.g., \citep{deletion,ScoreCAM}). Specifically, we remove the top-attributed tokens one by one and record how much the model’s predictive probability for class~$c$ decreases. The \emph{average probability drop score} captures, on a token-by-token basis, how effectively a map localizes truly important tokens.  

Any map with a drop score below a specified threshold $\rho$ (set in our experiments to the mean plus one standard deviation of all drop scores) is discarded. Finally, we generate the \mymethod\ attribution by averaging all remaining high-quality maps:
\begin{align*}
  &I(x) 
  \;:=\; 
  \frac{1}{|M|}
  \sum_{I_{R}(x)\,\in\,M}
  I_{R}(x),
  \\
  &\text{where}
  \quad
  M := \bigl\{\, I_{R}(x) \in D : S\bigl(I_{R}(x)\bigr) \ge \rho \bigr\}.
\end{align*}
This final aggregation fuses the most credible contrastive perspectives into a single, robust token-level attribution.  

\section{Experiments}

\paragraph{Experiment Settings}
We implemented our method, Contrast-CAT, using PyTorch (the code is available at \url{https://github.com/ku-air/Contrast-CAT}).
We used the BERT$_{\text{base}}$ model~\citep{bert}, consisting of $12$ encoder layers with $12$ attention heads, as the transformer-based model for our experiments (see the supplementary material for results using other transformer-based models). We evaluated our method on four popular datasets for text classification tasks: Amazon Polarity~\citep{amazon_yelp}, Yelp Polarity~\citep{amazon_yelp}, SST2~\citep{sst}, and IMDB~\citep{imdb}.
We reported our results using $2000$ random samples from the test sets of each dataset, except for SST2, for which the entire set was used since the entire dataset had fewer than $2000$ samples.

We compared our method to various attribution methods, categorized by attention-based: RawAtt, Rollout~\citep{Rollout}, Att-grads, Att$\times$Att-grads, and Grad-SAM~\citep{Gradsam}; LRP-based: Full LRP~\citep{FLRP}, Partial LRP~\citep{PartialLRP}, and TransAtt~\citep{Transatt}; and activation-based methods: CAT, AttCAT~\citep{Attcat}, and TIS~\citep{TIS}.

\paragraph{Evaluation Metrics}

We used the area over the perturbation curve (denoted by AOPC)~\citep{aopc1,aopc2_lodss2} and the log-odds (LOdds)~\citep{lodss,aopc2_lodss2} metrics for assessing the faithfulness of attribution following the previous research~\citep{Attcat}. Faithfulness refers to the accuracy with which an attribution map's scores reflect the actual influence of each input token on the model's prediction. The AOPC and LOdds metrics are defined as follows: (1) AOPC($k$) := $\frac{1}{N}\sum_{i=1}^{N} (y_i^c - \tilde y_i^c)$, and (2) LOdds($k$) := $\frac{1}{N}\sum_{i=1}^{N} \log\left(\frac{\tilde y_i^c}{y_i^c}\right)$.
Here, $N$ is the total number of data points used for evaluation, and $y_i^c$ denotes the model's prediction probability for the class $c$ of a given input token sequence $x$, while $\tilde y_i^c$ indicates the probability after removing the top-$k\%$ of input tokens based on relevance scores from an attribution map.

To evaluate attribution quality more precisely using the AOPC and LOdds metrics while addressing inconsistencies from token removal order (i.e., removing the most relevant tokens first versus the least relevant tokens first)~\citep{morf_lerf_consistency}, we conducted experiments under two settings: one where tokens were removed in descending order of relevance scores (MoRF: Most Relevant First), and another in ascending order (LeRF: Least Relevant First).
Consistently achieving high-quality attribution under both conditions indicates superior attribution quality. 
Specifically, under the MoRF setting, higher AOPC and lower LOdds indicate better attribution, while under the LeRF setting, lower AOPC and higher LOdds suggest better performance.


\subsection{Faithfulness of Attribution}\label{exp.faithful}


\begin{figure*}[tb]
\centering
\includegraphics[width=0.93\textwidth]{figs/plot_MoRF.pdf}
\caption{Quantitative comparison of the faithfulness evaluation of \mymethod and other attribution methods, measured under the MoRF (Most Relevant First) setting. The arrows mean that $\uparrow$: higher is better, and $\downarrow$: lower is better.}
\label{fig:plot.quantitative}
\end{figure*}

\begin{table*}[tb]
\centering
\caption{AUC values from the faithfulness evaluation, with (A) showing results under the MoRF (Most Relevant First) setting and (B) showing results under the LeRF (Least Relevant First) setting. The best and second-best results are highlighted in bold and underlined, respectively. The arrows mean that $\uparrow$: higher is better, and $\downarrow$: lower is better.}\label{tbl:all.measures}

\newcolumntype{C}[1]{>{\centering\arraybackslash}m{#1}}
\begin{tabular}{c|c|c|c|c|c|c|c|c}\toprule
\multicolumn{9}{c}{(A) MoRF (Most Relevant First)} \\\midrule
 Dataset& \multicolumn{2}{c|}{Amazon} & \multicolumn{2}{c|}{Yelp} & \multicolumn{2}{c|}{SST2}& \multicolumn{2}{c}{IMDB} \\\midrule%\cline{2-7}
 Method & AOPC$\uparrow$ & LOdds$\downarrow$ & AOPC$\uparrow$ & LOdds$\downarrow$ & AOPC$\uparrow$ & LOdds$\downarrow$ & AOPC$\uparrow$ & LOdds$\downarrow$ \\
\midrule

RawAtt & 0.424 & 0.405 & 0.412 & 0.462 & 0.386 & 0.471 & 0.335 & 0.564 \\ 
Rollout & 0.327 & 0.516 & 0.282 & 0.601 & 0.329 & 0.558 & 0.339 & 0.566 \\
Att-grads & 0.061 & 0.749 & 0.059 & 0.754 & 0.132 & 0.691 & 0.061 & 0.759 \\
Att$\times$Att-grads & 0.054 & 0.756 & 0.045 & 0.763 & 0.109 & 0.711 & 0.075 & 0.746 \\
Grad-SAM & 0.312 & 0.526 & 0.235 & 0.633 & 0.356 & 0.518 & 0.266 & 0.623 \\
Full LRP & 0.242 & 0.592 & 0.190 & 0.652 & 0.310 & 0.538 & 0.233 & 0.631 \\
Partial LRP & 0.463 & 0.356 & 0.447 & 0.422 & 0.400 & 0.461 & 0.364 & 0.538 \\
TransAtt & 0.461 & 0.366 & 0.473 & 0.404 & 0.432 & 0.428 & 0.458 & 0.455 \\
CAT & 0.482 & 0.341 & 0.440 & 0.383 & 0.452 & 0.382 & 0.632 & 0.215 \\
AttCAT & 0.527 & 0.292 & 0.470 & \underline{0.346} & 0.461 & 0.372 & \underline{0.644} & \underline{0.198} \\
TIS & \underline{0.560} & \underline{0.241} & \underline{0.494} & 0.349 & \underline{0.463} & \underline{0.367} & 0.618 & 0.277 \\
\mymethod & \textbf{0.703} & \textbf{0.117} & \textbf{0.687} & \textbf{0.131} & \textbf{0.654} & \textbf{0.157} & \textbf{0.738} & \textbf{0.101}
\\\toprule
\multicolumn{9}{c}{(B) LeRF (Least Relevant First)} \\\midrule
Dataset& \multicolumn{2}{c|}{Amazon} & \multicolumn{2}{c|}{Yelp} & \multicolumn{2}{c|}{SST2}& \multicolumn{2}{c}{IMDB} \\\midrule%\cline{2-7}
 Method & AOPC$\downarrow$ & LOdds$\uparrow$ & AOPC$\downarrow$ & LOdds$\uparrow$ & AOPC$\downarrow$ & LOdds$\uparrow$ & AOPC$\downarrow$ & LOdds$\uparrow$ \\
\midrule

RawAtt & 0.133 & 0.694 & 0.093 & 0.723 & 0.249 & 0.577 & 0.158 & 0.688 \\ 
Rollout & 0.166 & 0.670 & 0.130 & 0.687 & 0.373 & 0.448 & 0.126 & 0.711 \\
Att-grads & 0.636 & 0.186 & 0.560 & 0.252 & 0.601 & 0.223 & 0.588 & 0.271 \\
Att$\times$Att-grads & 0.707 & 0.111 & 0.660 & 0.145 & 0.681 & 0.126 & 0.709 & 0.127 \\
Grad-SAM & 0.139 & 0.677 & 0.107 & 0.713 & 0.285 & 0.547 & 0.118 & 0.715 \\
Full LRP & 0.254 & 0.588 & 0.187 & 0.649 & 0.377 & 0.454 & 0.199 & 0.656 \\
Partial LRP & 0.122 & 0.700 & 0.088 & 0.725 & 0.237 & 0.585 & 0.134 & 0.701 \\
TransAtt & 0.089 & 0.731 & \underline{0.063} & \underline{0.751} & 0.215 & 0.605 & \underline{0.061} & \underline{0.761} \\
CAT & 0.108 & 0.712 & 0.087 & 0.727 & 0.213 & 0.611 & 0.128 & 0.697 \\
AttCAT & \underline{0.078} & \underline{0.740} & \underline{0.063} & 0.747 & \underline{0.205} & \underline{0.623} & 0.119 & 0.703 \\
TIS & 0.104 & 0.719 & 0.082 & 0.737 & 0.252 & 0.562 & 0.135 & 0.691 \\
\mymethod & \textbf{0.058} & \textbf{0.757} & \textbf{0.048} & \textbf{0.759} & \textbf{0.147} & \textbf{0.669} & \textbf{0.047} & \textbf{0.775}
\\\bottomrule
\end{tabular}
\end{table*}

Figure~\ref{fig:plot.quantitative} illustrates the AOPC and LOdds values for attribution maps generated by each competing method, evaluated at various top-$k\%$ thresholds where $k$ is increased by $10$ within the range of $[10,90]$. Table~\ref{tbl:all.measures} provides the corresponding AUC values. 
Note that Figure~\ref{fig:plot.quantitative} presents results for the MoRF setting only, while Table~\ref{tbl:all.measures} includes results for both MoRF and LeRF settings.
Through this evaluation, we can analyze the overall characteristics of an attribution map in terms of relevance scores of different threshold levels.

The trends in Figure~\ref{fig:plot.quantitative} reveal that our method, \mymethod, consistently maintains faithful attribution quality across all threshold levels and datasets compared to other methods. Table~\ref{tbl:all.measures} further supports this, showing that \mymethod achieves top-$1$ attribution quality under both MoRF and LeRF settings. Specifically, compared to the second-best cases, \mymethod shows average improvements in AUC values of AOPC and LOdds under the MoRF setting by $\times 1.30$ and $\times 2.25$, respectively. For the LeRF setting, \mymethod shows average improvements in AUC values of AOPC and LOdds by $\times 1.34$ and $\times 1.03$, respectively.

\subsection{Qualitative Evaluation}
\begin{figure*}[tb]
\centering
\includegraphics[width=0.93\textwidth]{figs/qualitative.pdf}
\caption{Qualitative comparison of attribution quality. Relevance scores are shown with color shades: red for the highest importance, followed by orange.}
\label{fig:quality}
\end{figure*}

Figure~\ref{fig:quality} illustrates the attribution maps generated by \mymethod, TIS, and AttCAT, the top-$3$ ranked methods in our faithfulness evaluation, conducted under the MoRF setting (Table~\ref{tbl:all.measures}, (A) MoRF).
The examples provided are from the SST2 dataset. For ease of interpretation, only tokens with relevance scores exceeding $0.5$ are highlighted. 
As shown in the left side of Figure~\ref{fig:quality}, \mymethod identifies relevant tokens related to the predicted class, such as `fails' or `disappointment' for the negative prediction cases.
For a positive prediction, in the input phrase `rare birds have more than enough charm to make it memorable.', \mymethod highlights `enough' and `charm' as the most relevant tokens, with `than', `make', `more', and `memorable' following in relevance. In contrast, AttCAT focuses only on `enough' and `memorable', missing `charm' and `more', while TIS identifies `to' as the most relevant token.


\subsection{The Effect of Activation Contrast}

\begin{table*}[tb]
\centering
\caption{The effect of our activation contrasting approach, measured under the MoRF (Most Relevant First) setting. `Random' uses randomly selected references (the mean values over $30$ repetitions are reported), `Same' uses references from the same class as the target, and `Contrasting' refers to the suggested \mymethod. The best results are in boldface.}\label{tbl:ablation.denoise_effect}
\newcolumntype{C}[1]{>{\centering\arraybackslash}m{#1}}
\begin{tabular}{c|c|c|c|c|c|c|c|c}\toprule
 Dataset& \multicolumn{2}{c|}{Amazon} & \multicolumn{2}{c|}{Yelp} & \multicolumn{2}{c|}{SST2}& \multicolumn{2}{c}{IMDB} \\\midrule%\cline{2-7}
 Reference & AOPC$\uparrow$ & LOdds$\downarrow$ & AOPC$\uparrow$ & LOdds$\downarrow$ & AOPC$\uparrow$ & LOdds$\downarrow$ & AOPC$\uparrow$ & LOdds$\downarrow$ \\
\midrule

Random & 0.513 & 0.306 & 0.496 & 0.323 & 0.433 & 0.398 & 0.634 & 0.213 \\
Same & 0.144 & 0.667 & 0.159 & 0.650 & 0.089 & 0.728 & 0.124 & 0.614 \\
Contrasting & \textbf{0.703} & \textbf{0.117} & \textbf{0.687} & \textbf{0.131} & \textbf{0.654} & \textbf{0.157} & \textbf{0.738} & \textbf{0.101} \\\bottomrule
\end{tabular}
\end{table*}

To evaluate the effect of our \mymethod's activation contrasting, we compared the attribution quality of different versions of \mymethod: the `Random' version uses randomly selected references from individual training datasets instead of what had been outlined in Section~\ref{sec:multi-references}, and the `Same' version uses references of the same class as the target instead of different classes.
The `Same' version contrasts with our method, which leverages activations from different classes as contrastive references.

Table~\ref{tbl:ablation.denoise_effect} presents AUC values of each version of \mymethod, where the suggested \mymethod is denoted by `Contrasting'.
The attribution quality is the worst with `Same' and the best with `Contrasting', which indicates that the proposed activation contrasting effectively reduces non-target signals in the activations, thereby helping to generate high-quality attribution maps.


\subsection{Confidence of Attribution}

\begin{table}[tb]
\centering
\caption{The results of the confidence evaluation, showing averaged rank correlation values. The values below $0.05$ (marked in gray) indicate that attributions tend to be class-distinct, as desired.}\label{tbl:all.k_tau}
\setlength{\tabcolsep}{2pt}
\begin{tabular}{c|c|c|c|c}\toprule
 \multirow{2}{*}{Method} & \multicolumn{4}{c}{Dataset} \\\cline{2-5}
  & Amazon & Yelp & SST2 & IMDB \\
\midrule

RawAtt & 1.00 & 1.00 & 1.00 & 1.00 \\ 
Rollout & 1.00 & 1.00 & 1.00 & 1.00 \\ 
Att-grads & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 \\ 
Att$\times$Att-grads & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 \\ 
Grad-SAM & 0.158 & 0.138 & 0.282 & 0.084 \\ 
Full LRP & 0.732 & 0.629 & 0.712 & 0.533 \\ 
Partial LRP & 0.952 & 0.924 & 0.957 & 0.859 \\ 
TransAtt & 0.153 & 0.135 & 0.342 & 0.061 \\ 
CAT & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 \\ 
AttCAT & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 \\ 
TIS & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 \\ 
\mymethod & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 & \cellcolor[gray]{0.8}$<$ 0.05 \\ 

\bottomrule
\end{tabular}
\end{table}

If an attribution method consistently generates similar attribution maps regardless of the model's prediction, its confidence is questionable. Therefore, we conducted the confidence evaluation of the attribution methods employing the Kendall-$\tau$ rank correlation~\citep{kendall1948rank}, which is a statistical measure used to assess the similarity between two data by comparing the ranking order of their respective values. We compute an averaged rank correlation:
\begin{equation*}
\frac{1}{N}\sum_{i=1}^{N} \text{Kendall-}\tau(P^{c}_{i}, P^{\hat c}_{i}),
\end{equation*}
where $P^{c}_{i}$ is an array of token indices in descending order of relevance scores for class $c$ in an attribution map, $P^{\hat c}_{i}$ is a similar array but for the class $\hat c \neq c$, and $N$ is the total number of data points used for testing.
For the choice of $\hat c$, we followed the settings of AttCAT as detailed in their open-source implementation, where the class immediately following the class $c$ was chosen.

If an attribution method assigns relevance scores to tokens in distinct orders for different class predictions of the inspected model, the rank correlation is expected to be low.
Table~\ref{tbl:all.k_tau} presents the average rank correlation for various attribution methods tested across datasets. Cases with average rank correlation values under $0.05$ are marked as `$<0.05$' and highlighted: these are the cases where the attribution methods seem to work soundly -- our \mymethod seems to pass the test, along with Att-grads, Att$\times$Att-grads, CAT, AttCAT and TIS.
In contrast, methods such as RawAtt, Rollout, and Partial LRP showed values near $1.0$ consistently over the datasets, suggesting that these methods have issues generating distinct attribution over different class outcomes.

\subsection{The Effect of Using Multiple Layers}

\begin{figure*}[tb]
\centering
\includegraphics[width=0.93\textwidth]{figs/multiple.pdf}
\caption{Comparison of \mymethod's attribution quality measured under the MoRF (Most Relevant First) setting: (A) varying the number of layers from penultimate to all, and (B) varying the number of reference samples from $0$ to $30$.}
\label{fig:plot.l_wise}
\end{figure*}

Panel (A) of Figure~\ref{fig:plot.l_wise} demonstrates the effect of using multiple layers to improve the attribution quality of \mymethod. The figure shows the average AUC values of AOPC and LOdds across datasets, measured under the MoRF setting.

The results in panel (A) of Figure~\ref{fig:plot.l_wise} indicate that the attribution quality improves as the number of layers increases, with the best performance achieved when all layers are used. 
Specifically, there is a $\times 1.52$ improvement in AOPC and $\times 3.05$ improvement in LOdds when using all layers compared to using only the penultimate layer. The AOPC and LOdds values tend to saturate when we use three or more layers but continue to increase as the number increases. 

\subsection{The Effect of Multiple Contrasts}

Panel (B) of Figure~\ref{fig:plot.l_wise} illustrates the impact of increasing the number of references for multiple contrasts in \mymethod on attribution quality, measured by average AUC for AOPC and LOdds across datasets under the MoRF setting.

The AOPC metric shows a sharp improvement as the number of references increases from $0$ to $5$. After $5$ references, the AUC continues to increase, stabilizing between $25$ and $30$ references. In contrast, the LOdds metric exhibits a sharp decline as the number of references increases, starting at approximately $0.30$ and dropping steadily, stabilizing around $0.10$ after $10$ references and reaching its minimum at $30$ references.
These results indicate that more references improve attribution quality, with the best performance at $30$, which we use in our experiments.

\subsection{The Effect of Contrasting References}

\begin{table}[tb]
\centering
\caption{Impact of the parameter $\gamma$ in the condition \(f_c(r)<\gamma\) on the attribution quality of \mymethod.}\label{tbl:gamma}
\begin{tabular}{c|c|c|c}\toprule
  $\gamma$ & 0.1 & 0.01 & 0.001 \\
\midrule

AOPC$\uparrow$ & 0.627 & 0.651 & 0.696  \\ 
LOdds$\downarrow$ & 0.450 & 0.448 & 0.127 \\ 

\bottomrule
\end{tabular}
\end{table} 

Table~\ref{tbl:gamma} presents the impact of the parameter $\gamma$ in the condition for selecting contrastive references, \(f_c(r)<\gamma\), on \mymethod's attribution quality.
This condition ensures that selected reference activations do not strongly respond to the target class $c$, thereby helping to reduce non-target signals within the target activation by contrasting it with the selected reference activations.

We evaluated \mymethod's faithfulness by varying $\gamma$ from $0.1$ to $0.001$, and reported average AUC values for AOPC and LOdds across datasets under the MoRF setting.
The results in Table~\ref{tbl:gamma} indicate that a smaller $\gamma$ improves \mymethod's attribution quality, highlighting the benefits of low-activation references for activation contrasting, as described in Section~\ref{sec:component_details}.

\section{Conclusion}

In this work, we introduced \mymethod, a novel activation-based attribution method that leverages activation contrasting to generate high-quality token-level attribution map.
Our extensive experiments demonstrated that \mymethod significantly outperforms state-of-the-art methods across various datasets and models.

Despite its effectiveness, \mymethod requires reference points whose activations will be available during the creation of attribution maps.
While we minimized overhead with a pre-built reference library, its storage requirements grow with the number of classes and activation size. Future work will explore lower-cost alternative tensors.

As the demand for interpretable AI grows to support safety, security, and trustworthiness, we believe \mymethod represents a meaningful step toward improving the transparency of transformer-based models.


\begin{acknowledgements}
This work was supported by the Institute of Information \& Communications Technology Planning \& Evaluation(IITP) grant funded by the Korea government(MSIT) (RS-2024-00439819, AI-Based Automated Vulnerability Detection and Safe Code Generation) and by the IITP-ITRC(Information Technology Research Center) grant funded by the Korea government(MSIT)(IITP-2025-RS-2020-II201749).
\end{acknowledgements}

% References
\bibliography{uai2025-template}

\end{document}
