SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models

SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models

NeurIPS 2024 Workshop ATTRIB Submission14 Authors

Published: 30 Oct 2024, Last Modified: 14 Jan 2025ATTRIB 2024EveryoneRevisionsBibTeXCC BY 4.0

Release Opt Out: No, I don't wish to opt out of paper release. My paper should be released.

Keywords: Mechanistic Interpretability, Uncertainty Quantification, Large Language Models, AI Safety, Technical AI Governance

Abstract: We investigate the mechanistic sources of uncertainty in large language models (LLMs), an area with important implications for language model reliability and trustworthiness. To do so, we conduct a series of experiments designed to identify whether the factuality of generated responses and a model’s uncertainty originate in separate or shared circuits in the model architecture. We approach this question by adapting the well-established mechanistic interpretability techniques of causal tracing and two styles of zero-ablation to study the effect of different circuits on LLM generations. Our experiments on eight different models and five datasets, representing tasks predominantly requiring factual recall, provide strong evidence that a model’s uncertainty is produced in the same parts of the network that are responsible for the factuality of generated responses.

Submission Number: 14

Loading