Why It Failed: A Benchmark to Evaluate Interpretability

Joel Mathew; Aditya Lagu; Anthony Tang; Prudhviraj Naidu

Why It Failed: A Benchmark to Evaluate Interpretability

Joel Mathew, Aditya Lagu, Anthony Tang, Prudhviraj Naidu

Published: 14 Dec 2025, Last Modified: 11 Jan 2026LM4UC@AAAI2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Failure prediction, Interpretability, Benchmarking, AI governance

TL;DR: We introduce the Why It Failed benchmark to evaluate interpretability methods on model failure prediction

Abstract: We introduce *Why It Failed*, a benchmark for evaluating whether interpretability methods can explain model failures. We test last token logistic probes on Gemma-2 2B across four basic reasoning tasks and find they fail to predict model failures, achieving near-chance performance across all tasks. Our benchmark provides a standardized framework to evaluate whether interpretability methods can explain model failures. Lastly, we motivate the AI community to move beyond reporting quantitative metrics and seek explanations of when and why models fail.

Submission Number: 33

Loading