% Springer LNCS template
\documentclass[runningheads]{llncs}

\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage[T1]{fontenc}
\usepackage{url}
% Corresponding author marked with *

% For algorithms
\usepackage{algorithm}
\usepackage{algorithmic}

\begin{document}

\title{Why It Failed: A Benchmark to Evaluate Interpretability\thanks{Code available at \url{https://github.com/anthonytang/LLM-Detection-Testing}}}

% Springer LNCS author format
\author{
Joel Mathew\inst{1}\orcidID{0009-0009-1981-7073} \and
Aditya Lagu\inst{1} \and
Anthony Tang\inst{1} \and
Prudhviraj Naidu\inst{1,2}
}

\authorrunning{J. Mathew et al.}

\institute{
Algoverse AI Research \\
\email{joel.mathew@sjsu.edu} \and
University of California San Diego
}

\maketitle

\begin{abstract}
We introduce \textit{Why It Failed}, a benchmark for evaluating whether interpretability methods can explain model failures. We test last token logistic probes on Gemma-2 2B across four basic reasoning tasks and find they fail to predict model failures, achieving near-chance performance across all tasks. Our benchmark provides a standardized framework to evaluate whether interpretability methods can explain model failures. Lastly, we motivate the AI community to move beyond reporting quantitative metrics and seek explanations of when and why models fail.

\keywords{Interpretability \and Benchmark \and Language Models \and Probing \and Model Failures}
\end{abstract}

%------------------------------------------------
\section{Introduction}
\label{sec:introduction}

Large language models achieve impressive performance on diverse benchmarks, yet they still fail in puzzling and unpredictable ways. A model might score 80\% on a commonsense reasoning task, but this aggregate metric obscures crucial questions: \textit{When} does the model fail? \textit{Why} does it fail? Are failures random noise, or do they follow systematic patterns? Current evaluation practices report quantitative metrics without explaining the underlying failure modes, leaving practitioners uncertain about when and where models can be safely deployed.

The interpretability community has made significant progress in understanding what models know---identifying circuits that implement specific algorithms~\cite{wang2022interpretabilitywildcircuitindirect}, discovering human interpretable concepts in models' activations~\cite{bricken2023monosemanticity}. However, interpretability methods are typically validated on their ability to detect unwanted behaviors and monitor models in deployment~\cite{mckenzie2025detectinghighstakesinteractionsactivation}, rather than on their ability to systematically explain common failure cases of models on standard benchmarks.

This gap has important consequences. Without systematic explanations of failures, we cannot make informed deployment decisions. Explanations enable us to distinguish model-specific quirks from dataset artifacts, and to understand whether failure patterns generalize across model families or emerge from particular architectures. Most fundamentally, explaining failures allows us to move beyond simply reporting benchmark scores toward building scientific theories of model limitations.

We propose \textit{Why It Failed}, a benchmark for evaluating whether interpretability methods can explain model failures. Our contributions are twofold:

\begin{enumerate}
    \item We introduce a benchmark framework that shifts focus from explaining what models know to explaining what they don't know.
    \item We show that standard last-token probing is insufficient for predicting failures.
\end{enumerate}

We invite the interpretability community to test their methods against this benchmark and push toward explanations that illuminate model limitations.

The rest of the paper is organized as follows. Section~\ref{sec:related-work} reviews related work in benchmarking explanations, interpretability methods and generating explanations using SAEs. Section~\ref{sec:why-it-failed-benchmark} describes our benchmark framework and how we generated our benchmark examples. Section~\ref{sec:probe-experiments} describes our probing experiment methodology and discusses our results. In Section~\ref{sec:discussion}, we discuss limitations of our work and provide questions for future work.

%------------------------------------------------
\section{Related Work}
\label{sec:related-work}

\subsection{Benchmarking Interpretability Methods}
\label{subsec:benchmarking-interpretability}

Mills et al.~\cite{mills2025almanacssimulatabilitybenchmarklanguage} propose ALMANACS, a benchmark to test how well explanations can predict model behaviour on new inputs. They focus on behaviour related to AI safety such as ethical reasoning, self preservation or harmful requests. The authors construct such safety scenarios to invoke specific model behaviour. They primarily generate explanations by blackbox methods or through attribution methods.

Our approach on the other hand focuses on explaining model failures in standard benchmarks. We select established reasoning benchmarks that measure basic model capabilities. We then establish our baseline via probing models' internal activations, which has seen great success in recognizing diverse model behaviours and steering models~\cite{beaglehole2025universalsteeringmonitoringai,mckenzie2025detectinghighstakesinteractionsactivation}.

Chandrasekaran et al.~\cite{chandrasekaran-etal-2018-explanations} study a failure prediction task for a visual question answering (VQA) model. Humans are asked to predict whether the VQA model will get a question right or wrong. They find humans are able to predict model failures above chance. However, explainable methods do not improve human performance. We on the other hand focus on pure language models and first ask what is the best predictive performance we can get for predicting model failures.

\subsection{Interpretability Methods for Language Models}
\label{subsec:interpretability-methods}

Recent interpretability research has focused on three key factors: understanding model traces (CoT-Interpretability), understanding the internal representations of language models (Sparse Autoencoders), and training probes on internal representations of models. These methods vary in their approach regarding what aspect of the model they study but are focused on primarily monitoring and steering models.

\textbf{Chain-Of-Thought (CoT) Interpretability} analyzes reasoning traces that the language model produces before the final response. Baker et al.~\cite{baker2025monitoringreasoningmodelsmisbehavior} were able to detect models reward hacking by monitoring its reasoning traces. Korbak et al.~\cite{korbak2025chainthoughtmonitorabilitynew} argues CoT interpretability is a promising research direction; however CoTs can easily succumb to optimization pressures of directly training CoT and lose their usefulness for monitoring. Kirchner et al.~\cite{kirchner2024proververifiergamesimprovelegibility} proposes a scheme to make models' reasoning more legible by posing the task as a prover-verifier game where the model (as a prover) has to provide legible explanations for its answer that allows a verifier to predict whether the model got the answer right or wrong.

\textbf{Probing methods} train auxiliary classifiers on a base model's hidden states to detect concepts that the base model was not explicitly trained for~\cite{pmlr-v119-chen20s}. Subsequent work has shown linear probing can detect unsafe behaviour and steer the model towards a desirable behaviour~\cite{beaglehole2025universalsteeringmonitoringai,mckenzie2025detectinghighstakesinteractionsactivation}.

\textbf{Sparse Autoencoders} learn overcomplete, sparse decompositions of neural activations, hypothesizing that individual SAE features correspond to monosemantic concepts~\cite{cunningham2023sparseautoencodershighlyinterpretable,bricken2023monosemanticity}. Jiang et al.~\cite{jiang2025towards} use SAEs to generate explanations for differences in datasets. For GSM8K, they demonstrated that ``Math word problems involving time, distance, and speed'' had lower accuracy.

Our benchmark evaluates whether these interpretability methods can be repurposed to identify systematic patterns that distinguish successes from failures.

%------------------------------------------------
\section{Why It Failed Benchmark}
\label{sec:why-it-failed-benchmark}

\subsection{Benchmark Overview}

\textbf{Core Question:} Can current interpretability methods explain why models fail? When a model makes errors on a benchmark, we want to know whether interpretability techniques can surface explanations that genuinely capture the underlying causes of these failures.

\textbf{What constitutes a good failure explanation?} We argue that a good explanation must satisfy three criteria:

\begin{enumerate}
    \item \textbf{Faithful}: Given the explanation, we should be able to predict whether the model will fail on unseen inputs or when deployed in new settings.

    \item \textbf{Causal}: We should be able to manipulate model performance based on the explanation, either by generating adversarial inputs that exploit the identified weakness or by improving the model performance (by identifying training dataset or methodology issues) based on the explanation.

    \item \textbf{Human-interpretable}: A human examining the explanation should be able to predict whether the model will fail on new inputs, without needing to run the model or inspect its internals.
\end{enumerate}

In this work, we focus on \textbf{faithfulness} as a measurable, necessary (though not sufficient) condition for good explanations. While causal interventions and human-interpretability are important criteria, we leave their systematic evaluation to future work.

\textbf{Evaluation Pipeline:} For each task, we construct a balanced dataset of $k$ success cases and $k$ failure cases. An \textbf{Explainer} (e.g., linear probe, chain-of-thought interpreter, SAEs) processes these $2k$ training instances to generate explanations. A \textbf{Predictor} then uses these explanations to classify whether the model will succeed or fail on new, unseen test instances. We measure the Predictor's performance using AUC on the held-out test set.

\subsection{Task and Model Selection}

\textbf{Tasks:} Our benchmark is constructed from four diverse reasoning tasks from standard LLM evaluation suites: PIQA (physical commonsense reasoning), BoolQ (yes/no reading comprehension), WinoGrande (common-sense coreference resolution), and Social IQa (social commonsense reasoning). These tasks span a variety of cognitive capabilities:

\begin{itemize}
    \item \textbf{PIQA}~\cite{Bisk2020}: Given a goal and two potential solutions, the model must select which action would successfully achieve the goal. For example: \textit{``How do I ready a guinea pig cage for its new occupants?''} with options involving paper strips vs.\ jeans material as bedding. This tests physical commonsense about everyday object interactions.

    \item \textbf{BoolQ}~\cite{clark2019boolq}: Given a passage and a yes/no question, the model must determine the correct boolean answer. For example: \textit{``Does ethanol take more energy to make than it produces?''} paired with a technical passage. This tests reading comprehension and factual reasoning.

    \item \textbf{WinoGrande}~\cite{ai2:winogrande}: Fill-in-the-blank coreference resolution where the model must determine which noun a pronoun refers to. For example: \textit{``John moved the couch from the garage to the backyard to create space. The \_ is small.''} (options: garage, backyard). This tests common-sense reasoning about spatial and physical constraints.

    \item \textbf{Social IQa}~\cite{sap-etal-2019-social}: Questions about people's actions, intentions, and social implications. For example: \textit{``Sydney walked past a homeless woman asking for change but did not have any money. Sydney felt bad afterwards. How would you describe Sydney?''} This tests social and emotional reasoning.
\end{itemize}

\textbf{Why these tasks?} These tasks represent fundamental capabilities required for real-world language understanding: physical intuition, reading comprehension, linguistic reasoning, and social awareness. Critically, despite being simple multiple-choice or boolean questions, models can fail on them for complex, non-obvious reasons. A model might fail on PIQA not due to lacking physical commonsense reasoning but due to vocabulary gaps or specific cultural contexts unrelated to physical commonsense reasoning. The tasks span a diverse range of contexts and are widely-used benchmarks in the LLM community as a way to measure LLM capability and decide whether to deploy LLMs.

\textbf{Model:} We use Gemma-2 2B~\cite{gemmateam2024gemma2improvingopen}, a 2-billion parameter transformer model developed by Google. We chose Gemma-2 2B because: (1) it is small enough to run efficiently on consumer hardware, enabling rapid iteration and making our benchmark accessible to researchers without extensive computational resources; (2) it is a practical, industry-grade model designed for real-world deployment rather than a synthetic research model, ensuring our findings generalize to models used in practice.

\textbf{Data Collection:} We construct our benchmark by collecting model responses on these four tasks and categorizing them into correct and incorrect predictions. For each task, we randomly sample from the evaluation set until we collect exactly $k$ correct predictions and $k$ incorrect predictions, creating a balanced dataset. All examples retain their standard multiple-choice or boolean format from the original tasks.

%------------------------------------------------
\section{Probes Fail to Predict Model Failures}
\label{sec:probe-experiments}

\subsection{Experimental Setup}
\label{subsec:experimental-setup}

\subsubsection{Transformer Architecture and Residual Stream.}
Consider a transformer model with $L$ layers. For each layer $l \in \{1, 2, ..., L\}$ and token position $t$, we define the layer computation as:
\begin{align}
\mathbf{x}_{\text{mid}}^{(l,t)} &= \mathbf{x}_{\text{pre}}^{(l,t)} + \sum_{\text{head } h} \text{attn}^{(l,h)}\left(\mathbf{x}_{\text{pre}}^{(l,t)}, \mathbf{x}_{\text{pre}}^{(l,1:t)}\right) \label{eq:mid-residual}\\
\mathbf{x}_{\text{post}}^{(l,t)} &= \mathbf{x}_{\text{mid}}^{(l,t)} + \text{MLP}^{(l)}\left(\mathbf{x}_{\text{mid}}^{(l,t)}\right) \label{eq:post-residual}
\end{align}
where $\mathbf{x}_{\text{pre}}^{(l,t)} \in \mathbb{R}^d$ is the input to layer $l$ at position $t$ (the pre-residual stream), $d$ is the transformer model dimension, $\mathbf{x}_{\text{mid}}^{(l,t)} \in \mathbb{R}^d$ is the mid-residual stream (after attention), $\mathbf{x}_{\text{post}}^{(l,t)} \in \mathbb{R}^d$ is the output of layer $l$ (post-residual stream), $\text{attn}^{(l,h)}$ denotes the $h$-th attention head in layer $l$, and $\text{MLP}^{(l)}$ denotes the feedforward network in layer $l$.

\subsubsection{Probe Training.}
We train linear probes on the post-residual stream $\mathbf{x}_{\text{post}}^{(l,t)}$ at the final token position of each prompt. Given a prompt from task $\mathcal{T}$ with tokens $t_1, \ldots, t_n$, we extract activations $\mathbf{x}_{\text{post}}^{(l,n)}$ at layers $l \in \{5, 10, 15, 20, 25\}$ for Gemma-2 2B. These layers are evenly spaced to capture representation learning throughout the model's depth.

For each layer $l$ and task $\mathcal{T}$, we train a binary logistic regression classifier:
\begin{equation}
\hat{y} = \sigma(\mathbf{w}^{(l)} \cdot \mathbf{x}_{\text{post}}^{(l,n)} + b)
\end{equation}
where $\mathbf{w}^{(l)} \in \mathbb{R}^d$ is the probe direction, $b \in \mathbb{R}$ is a bias term, and $\sigma$ is the sigmoid function. The probe is trained to predict whether the model will succeed ($y=1$) or fail ($y=0$) on the given prompt. We use scikit-learn's logistic regression implementation with default hyperparameters.

\subsubsection{Data Splits.}
For each task, we partition our 4,000 examples (2,000 successes and 2,000 failures) into:
\begin{itemize}
    \item \textbf{Training set}: 2,000 examples (1,000 successes, 1,000 failures)
    \item \textbf{Validation set}: 1,000 examples (500 successes, 500 failures)
    \item \textbf{Test set}: 1,000 examples (500 successes, 500 failures)
\end{itemize}
We report test set performance using Area Under the ROC Curve (AUC) as our primary metric. An AUC of 0.5 indicates chance-level performance, while 1.0 indicates perfect classification.

\subsection{Results}
\label{subsec:probe-results}

Figure~\ref{fig:probe-performance-across-tasks} shows the test set AUC for linear probes trained at different layers across all four tasks. The results reveal a striking failure: \textbf{linear probes do not outperform random chance at predicting model failures}.

\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{probe_faithfulness.png}
\caption{Last token probes fail to predict model failures. They consistently fail for all layers except on BoolQ where latter layers slightly outperform random chance. For each task, we plot the test ROC AUC score for a probe trained on the output of a layer. Error bars refer to 99.9\% confidence intervals. The dotted red line depicts random chance performance.}
\label{fig:probe-performance-across-tasks}
\end{figure}

Across all tasks and layers, probe performance clusters near the random baseline (AUC = 0.5, shown as a red dashed line). The best-performing configuration achieves only modest improvement: BoolQ at higher layers achieves performance slightly above chance.

We observe several consistent patterns:
\begin{enumerate}
    \item \textbf{No layer captures failure modes}: Performance remains near chance across all chosen layers, suggesting that the distinction between success and failure is not linearly encoded at any single layer's residual stream.

    \item \textbf{Task variation is minimal}: All tasks exhibit fundamentally the same pattern of near-chance performance, barring probes trained on latter layers on BoolQ. This suggests the limitation is not task-specific but reflects a broader issue with using linear probes as explainers.
\end{enumerate}

These results demonstrate that linear probes, despite their success in identifying semantically meaningful directions in prior work, fail as explainers in our benchmark. The model's internal representations do not contain linearly accessible information sufficient to predict when the model will fail.

%------------------------------------------------
\section{Discussion and Conclusion}
\label{sec:discussion}

We introduce \textit{Why It Failed}, a benchmark for evaluating whether interpretability methods can explain model failures. Our framework operationalizes faithful explanations through predictive power: good explanations should enable prediction of failures on unseen inputs. While we focus on faithfulness in this work, our framework naturally extends to causal manipulation and human-interpretability criteria.

We showed that simple last-token probing fails to explain model failures. It remains unclear why probes slightly outperformed random chance on BoolQ. Our work can easily be extended to include:

\begin{itemize}
    \item \textbf{Richer probing methods}: Mean-token probing across sequences, attention-probes~\cite{mckenzie2025detectinghighstakesinteractionsactivation}
    \item \textbf{Sparse representation methods}: Sparse autoencoders (SAEs) can identify clusters and offer explanations of why models fail~\cite{jiang2025towards}
    \item \textbf{Chain-of-thought interpretability}: Analyzing reasoning traces in models that produce intermediate steps. For these tasks, we did not generate reasoning traces from Gemma-2 2B. Future work would need to first generate examples of success and failure with reasoning traces.
\end{itemize}

Furthermore, we currently measure only faithfulness through predictive power. Future work should incorporate:
\begin{itemize}
    \item \textbf{Causal metrics}: Can explanations generate adversarial examples that flip model predictions? Can they guide interventions that improve performance?
    \item \textbf{Human-interpretability scoring}: Auto-interpretability methods~\cite{paulo2025automaticallyinterpretingmillionsfeatures} can generate explanations for SAE features. Kantamneni et al.~\cite{kantamneni2025sparseautoencodersusefulcase} propose a similar framework that could be used to explain logistic probes. However, it remains an open question whether these methods provide faithful human-interpretable explanations.
\end{itemize}

\subsection{Conclusion}

The \textit{Why It Failed} benchmark asks what models don't know and why. Our key message is simple: \textbf{we should aim to explain benchmark failures, not just report accuracy numbers}. This benchmark provides a concrete framework for evaluating whether interpretability methods achieve this goal. While linear probes fall short, they represent just the beginning. We invite the community to test their methods against this benchmark and push toward explanations that truly illuminate model limitations.

%------------------------------------------------
%------------------------------------------------
\begin{credits}
\subsubsection{\discintname}
The authors have no competing interests to declare that are relevant to the content of this article.
\end{credits}

%------------------------------------------------
% References
\bibliographystyle{splncs04}
\bibliography{springer_refs}

\end{document}
