ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

ACL ARR 2026 January Submission6583 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: explainability, faithfulness evaluation, feature attribution, large language models, statistical testing, multilingual NLP, interpretability
Abstract: Evaluating explanation faithfulness---whether explanations reflect a model's true reasoning---remains challenging. Benchmarks like ERASER use single intervention strategies without statistical rigor, making it hard to separate genuine faithfulness from noise. We introduce \ice{} (Intervention-Consistent Explanation), a framework addressing these gaps through randomization tests with win rate metrics and bootstrap confidence intervals. We evaluate 7 LLMs across 4 datasets and 4 languages using native sentiment data, comparing attention and gradient attribution. Key findings: (1) attention beats gradient on short text (+10--20\%) but both converge on long text; (2) faithfulness and human plausibility are orthogonal ($|r| < 0.04$), implying they must be evaluated independently; (3) NLI yields highest faithfulness (Llama 3.1-8B: 97.2\% gradient win rate); (4) multilingual results vary widely---Qwen achieves 82.7\% German attention while GPT-2 shows anti-faithfulness on French (15\%); (5) some configurations perform worse than random, a critical warning for practitioners. We release the ICE framework and benchmark to facilitate future research.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: explanation faithfulness, feature attribution
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English, German, French, Hindi, Chinese
Submission Number: 6583
Loading