Improving Explanation Faithfulness via Counterfactual Tests and Activation-Level Steering

Improving Explanation Faithfulness via Counterfactual Tests and Activation-Level Steering

ACL ARR 2026 January Submission7988 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: explanation faithfulness, natural language explanations, counterfactual data augmentation, activation steering, natural language inference

Abstract: Recent advances in explainable artificial intelligence have emphasized generating natural language explanations (NLE) to justify model predictions. However, NLE often fail to faithfully reflect a model’s underlying decision process, potentially misleading users and undermining trust in deployed systems. In this work, we aim to improve explanation faithfulness in the natural language inference (NLI) setting by automatically constructing a dataset of unfaithful explanations using counterfactual tests and leveraging it for activation-level steering. Starting from the e-SNLI dataset, we apply rule-based counterfactual edits that locally modify hypotheses and regenerate NLI labels and explanations for the edited premise–hypothesis pairs. Among cases where the predicted label changes, we identify unfaithful explanations as those that completely ignore the attribute introduced by the counterfactual edit. To reduce false positives from surface-level matching, we further introduce attribute-based semantic filtering. Using the resulting high-confidence unfaithful explanations, we compute steering vectors via Contrastive Activation Addition (CAA) and apply them during decoding to adjust the model’s internal representations toward greater causal alignment between predictions and explanations. Experimental results show consistent improvements in explanation faithfulness not only under the Adding Modifier (AM) rule but also across multiple counterfactual rules. Importantly, NLI prediction accuracy on in-distribution evaluation sets remains largely unchanged, indicating that the proposed method enhances explanation faithfulness without degrading predictive performance.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: explanation faithfulness, counterfactual/contrastive explanations, free-text/natural language explanations

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 7988

Loading