Beyond Fooling: Model Manipulation Under Explanation-aware Training

ACL ARR 2026 January Submission4981 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: XAI, adversarial training, explanation-aware training, NLP
Abstract: Feature-level explanations are commonly used to interpret transformer-based NLP models. Still, little is known about how explanation-aware objectives influence model behaviour during training. While prior work has demonstrated training-time manipulation of explanations in vision models, its implications for transformers and token-level explainability remain unexplored. We study training-time manipulation of token-level explanations in transformer-based NLP classifiers and introduce sequence-aware objectives suited to text input. We show that explanation-aware training systematically alters token relevance patterns while largely preserving task accuracy. Importantly, masking and cross-method evaluations reveal that these attribution changes can coincide with shifts in model reliance rather than isolated failures of specific explanation methods. Our results suggest that apparent vulnerabilities of feature-level explanations can reflect deeper model adaptations, underscoring the need to consider learning dynamics when interpreting explanation robustness.
Paper Type: Short
Research Area: Special Theme (conference specific)
Research Area Keywords: XAI, adversarial machine learning, adversarial model manipulations
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 4981
Loading