Keywords: explanation faithfulness, natural language explanations, counterfactual data augmentation, activation steering, natural language inference
Abstract: Recent advances in explainable artificial intelligence have emphasized generating natural language explanations (NLE) to justify model predictions. However, NLE often fail to faithfully reflect a model’s underlying decision process, potentially misleading users and undermining trust in deployed systems. In this work, we aim to improve explanation faithfulness in the natural language inference (NLI) setting by automatically constructing a dataset of unfaithful explanations using counterfactual tests and leveraging it for activation-level steering. Starting from the e-SNLI dataset, we apply rule-based counterfactual edits that locally modify hypotheses and regenerate NLI labels and explanations for the edited premise–hypothesis pairs. Among cases where the predicted label changes, we identify unfaithful explanations as those that completely ignore the attribute introduced by the counterfactual edit. To reduce false positives from surface-level matching, we further introduce attribute-based semantic filtering. Using the resulting high-confidence unfaithful explanations, we compute steering vectors via Contrastive Activation Addition (CAA) and apply them during decoding to adjust the model’s internal representations toward greater causal alignment between predictions and explanations.
Experimental results show consistent improvements in explanation faithfulness not only under the Adding Modifier (AM) rule but also across multiple counterfactual rules. Importantly, NLI prediction accuracy on in-distribution evaluation sets remains largely unchanged, indicating that the proposed method enhances explanation faithfulness without degrading predictive performance.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: explanation faithfulness, counterfactual/contrastive explanations, free-text/natural language explanations
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 7988
Loading