Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations

Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations

ACL ARR 2026 January Submission3749 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Explainability, Self-explanations, Counterfactual simulatability

Abstract: Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model’s true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as *counterfactual simulatability*. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model’s answers to counterfactual follow-up questions, with and without access to the model’s chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with counterfactuals produced via pragmatics-based perturbation strategies. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model’s behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: free-text/natural language explanations, explanation faithfulness

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 3749

Loading