Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations
Keywords: Explainability, Self-explanations, Counterfactual simulatability
Abstract: Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model’s true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as *counterfactual simulatability*. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model’s answers to counterfactual follow-up questions, with and without access to the model’s chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with counterfactuals produced via pragmatics-based perturbation strategies. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model’s behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: free-text/natural language explanations, explanation faithfulness
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 3749
Loading