Source Latex Files: zip
PDF File: pdf
License To Publish: pdf
Keywords: Failure prediction, Interpretability, Benchmarking, AI governance
Corresponding Author Name: Joel Mathew
TL;DR: We introduce the Why It Failed benchmark to evaluate interpretability methods on model failure prediction
Corresponding Author Email: joel.mathew@sjsu.edu
Abstract: We introduce *Why It Failed*, a benchmark for evaluating whether interpretability methods can explain model failures. We test last token logistic probes on Gemma-2 2B across four basic reasoning tasks and find they fail to predict model failures, achieving near-chance performance across all tasks. Our benchmark provides a standardized framework to evaluate whether interpretability methods can explain model failures. Lastly, we motivate the AI community to move beyond reporting quantitative metrics and seek explanations of when and why models fail.
Submission Number: 33
Loading