An Evaluation Framework for Explainability Approaches in Seq2Seq Machine Translation Models

An Evaluation Framework for Explainability Approaches in Seq2Seq Machine Translation Models

ACL ARR 2025 May Submission7912 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The study of the attribution of input features to the output of neural network models is an active area of research. While numerous explainability techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in the sequence-to-sequence models remains underexplored. This paper introduces a novel approach for evaluating explainability methods in transformer-based seq2seq models, building upon forward simulation of XAI methods. Our method transfers learned knowledge in the form of attribution maps from a larger model to a smaller one and quantifies the resulting impact on performance. We evaluate eight explainability methods using the Inseq library to extract attribution scores linking input and output sequences. This information is then injected into the attention mechanism of an encoder-decoder transformer for machine translation. Our results show that this framework serves both as an automatic evaluation tool for explainability techniques and as a knowledge distillation strategy that enhances model performance. Our experiments demonstrate that Attention attributions and Value Zeroing methods consistently improved results on three machine translation tasks and four composition settings. The codes will be available on Github.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Explainable AI, Machine Translation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English, German, French, Arabic

Submission Number: 7912

Loading