Abstract: Explanations are an important tool for gaining insights into model behavior, calibrating user trust, and ensuring compliance.
Past few years have seen a flurry of methods for generating explanations, many of which involve computing model gradients or solving specially designed optimization problems.
Owing to the remarkable reasoning abilities of LLMs, _self-explanation_, i.e., prompting the model to explain its outputs has recently emerged as a new paradigm.
We study a specific type of self-explanations, _self-generated counterfactual explanations_ (SCEs).
We design tests for measuring the efficacy of LLMs in generating SCEs. Analysis over various LLM families, sizes, temperatures, and datasets reveals that LLMs often struggle to generate SCEs. When they do, their prediction often does not agree with their own counterfactual reasoning.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: counterfactual/contrastive explanations, explanation faithfulness
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7963
Loading