Keywords: circuits, mechanistic interpretability, language models, extractive QA
TL;DR: We mechanistically investigate extractive QA tasks and find that the circuit components can be used towards reliable data attribution
Abstract: Recent studies have extracted circuits from the computational graphs of language models for simple language tasks such as entity tracking or indirect object identification. In our paper, we scale up circuit extraction to a real-world language modeling task: context-augmented language modeling for question-answering (QA) tasks and understand the potential benefits of circuits towards downstream applications such as data attribution. We extract circuits as a function of internal model components (e.g., attention heads, attention layers, MLPs) using causal mediation analysis techniques. Leveraging the extracted circuits, we first understand the interplay between the language model's usage of parametric memory and retrieved context towards a better mechanistic understanding of context-augmented language models. We then identify a small set of attention heads in our circuit which performs reliable data attribution by default, thereby obtaining attribution for free in just the model's forward pass! Using this insight, we then introduce AttnAttrib, a fast data attribution algorithm. Through a range of empirical experiments across different extractive QA benchmarks, we show that performing data attribution with AttnAttrib obtains state-of-the-art attribution results across different language models. Finally, we show the possibility to steer the language model towards answering from the context, instead of the parametric memory by (i) using the attribution from our extracted attention head as an additional signal during the forward pass and (ii) scaling the output of a small set of attention heads. Beyond mechanistic understanding, our paper provides tangible applications of mechanistic circuits in the form of reliable data attribution and model steering.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4614
Loading