Abstract: The study of the attribution of input features to the output of neural network models is an active area of research. While numerous explainability techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in the sequence-to-sequence models remains underexplored.
This paper introduces a novel approach for evaluating explainability methods in transformer-based seq2seq models, building upon forward simulation of XAI methods. Our method transfers learned knowledge in the form of attribution maps from a larger model to a smaller one and quantifies the resulting impact on performance. We evaluate eight explainability methods using the Inseq library to extract attribution scores linking input and output sequences. This information is then injected into the attention mechanism of an encoder-decoder transformer for machine translation. Our results show that this framework serves both as an automatic evaluation tool for explainability techniques and as a knowledge distillation strategy that enhances model performance. Our experiments demonstrate that Attention attributions and Value Zeroing methods consistently improved results on three machine translation tasks and four composition settings. The codes will be available on Github.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Explainable AI, Machine Translation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English, German, French, Arabic
Submission Number: 7912
Loading