Keywords: attribution, generative models, large language models
TL;DR: We learn to pinpoint the in-context information that a language model uses when generating content, using attention weights as features.
Abstract: Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that *influence* the model to generate this sequence. Performing such *token attribution* is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as *features*. This way, we can *learn* how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient.
Primary Area: interpretability and explainable AI
Submission Number: 23464
Loading