An Attribution Method for Siamese Encoders

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Short Paper
Submission Track: Interpretability, Interactivity, and Analysis of Models for NLP
Submission Track 2: Machine Learning for NLP
Keywords: feature attribution, interpretability, explainability, siamese encoder, sentence transformer, integrated gradients, integrated Jacobians
TL;DR: This paper derives a local attribution method for siamese encoders by generalizing the principle of integrated gradients to models receiving multiple inputs.
Abstract: Despite the success of Siamese encoder models such as sentence transformers (ST), little is known about the aspects of inputs they pay attention to. A barrier is that their predictions cannot be attributed to individual features, as they compare two inputs rather than processing a single one. This paper derives a local attribution method for Siamese encoders by generalizing the principle of integrated gradients to models with multiple inputs. The output takes the form of feature-pair attributions and in case of STs it can be reduced to a token--token matrix. Our method involves the introduction of integrated Jacobians and inherits the advantageous formal properties of integrated gradients: it accounts for the model's full computation graph and is guaranteed to converge to the actual prediction. A pilot study shows that in case of STs few token pairs can dominate predictions and that STs preferentially focus on nouns and verbs. For accurate predictions, however, they need to attend to the majority of tokens and parts of speech.
Submission Number: 3838
Loading