Keywords: large language models, interpretability, self-explanations, functional consistency
TL;DR: We investigate the self-consistency of LLM self-explanations through sufficiency, comprehensiveness, and counterfactuality, revealing systematic differences between open- and closed-source models.
Abstract: Large Language Models (LLMs) have been widely adopted in text classification tasks, where they not only output class predictions but also generate explanations that highlight the tokens deemed most relevant for reaching the predicted label. Yet it remains unclear whether these highlighted elements faithfully reflect the underlying decision process of the model. While much of the literature evaluates the textual plausibility of such explanations, few studies assess their functional consistency with the model’s actual behavior. In this work, we propose an experimental framework based on the principle of auto-consistency: if a model identifies certain tokens as decisive, then isolating, removing, or semantically inverting them should produce systematic and interpretable changes in its predictions. We operationalize this evaluation through sufficiency, comprehensiveness, and counterfactuality metrics, and conduct experiments on IMDB and Steam reviews across both closed-source (GPT-4o) and open-source LLMs (Gemma3, Granite8B, DeepSeek). Results show that GPT-4o follows the expected progression across all metrics, Gemma3 and Granite8B maintain coherence under sufficiency but lose consistency under more demanding interventions, while DeepSeek variants display structural deviations, either failing to preserve sufficiency or overreacting under comprehensiveness and counterfactuality. These findings show that explanation reliability varies across LLM families and scales, with smaller models displaying contradictions and larger ones exhibiting over-sensitivity. By combining sufficiency, comprehensiveness, and counterfactuality, our approach provides a systematic methodology for assessing the functional consistency of LLM self-explanations.
Primary Area: interpretability and explainable AI
Submission Number: 21430
Loading