Keywords: LLMs, explanations, faithfulness
TL;DR: We study whether LLM self-explanations are faithful by analyzing counterfactual tests, theoretically and empirically.
Abstract: When asked to explain their decisions, large language models (LLMs) can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse counterfactual faithfulness across 75 models from 13 families. We analyze the tradeoff between conciseness and comprehensiveness, how correlational faithfulness metrics assess this tradeoff, and the extent to which metrics can be gamed. This analysis motivates two new metrics: the phi-CCT, a simplified variant of the Correlational Counterfactual Test (CCT) that avoids the need for token probabilities while explaining most of the variance of the original test; and F-AUROC, which eliminates sensitivity to imbalanced intervention distributions and captures a model’s ability to produce explanations with different levels of detail. Our findings reveal a clear scaling trend: larger and more capable models are consistently more faithful on all metrics we consider. We release our code for reproducibility.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 24737
Loading