Abstract: The importance of input features in the decision-making of neural network models is a well-explored area of machine learning research. Numerous approaches have been developed to estimate and explain the behavior of these models.
Among the models that rely on neural networks, the sequence-to-sequence (seq2seq) architecture is particularly complex. Although general techniques can be applied to these models, the evaluation of explainability methods in this context remains underexplored.
In this paper, we propose a novel approach, based on forward simulatabitly, to automatically evaluate explainability methods for transformer-based seq2seq models. The idea is to inject the learned knowledge from a large model into a smaller one and measure the change in the results for the smaller model. We experiment with eight explainability methods using inseq to extract the attribution score of input to the output sequence. Then, we inject this information into the attention mechanism of an encoder-decoder transformer model for machine translation. Our results demonstrate that this framework can serve as an automatic evaluation method for explainability techniques and a knowledge distillation process that improves performance.
According to our experiments, the attention attribution and value zeroing methods consistently increased the result in three machine translation tasks and composition operators.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Explainable AI, Machine Translation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: Engish, German, French, Arabic
Submission Number: 8333
Loading