Can LLMs Explain Themselves Counterfactually?

Can LLMs Explain Themselves Counterfactually?

ACL ARR 2025 May Submission7963 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Explanations are an important tool for gaining insights into model behavior, calibrating user trust, and ensuring compliance. Past few years have seen a flurry of methods for generating explanations, many of which involve computing model gradients or solving specially designed optimization problems. Owing to the remarkable reasoning abilities of LLMs, _self-explanation_, i.e., prompting the model to explain its outputs has recently emerged as a new paradigm. We study a specific type of self-explanations, _self-generated counterfactual explanations_ (SCEs). We design tests for measuring the efficacy of LLMs in generating SCEs. Analysis over various LLM families, sizes, temperatures, and datasets reveals that LLMs often struggle to generate SCEs. When they do, their prediction often does not agree with their own counterfactual reasoning.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: counterfactual/contrastive explanations, explanation faithfulness

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7963

Loading