TL;DR: Internal causal mechanisms robustly predict language model out-of-distribution behaviors.
Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks—including symbol manipulation, knowledge retrieval, and instruction following—we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model’s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.
Lay Summary: Large language models are powerful but often unpredictable, especially when they face unfamiliar inputs. A big question in interpretability research is: can we peek inside these models to understand why they behave the way they do—and use that understanding to predict how they will act on unseen inputs?
Our research suggests the answer is yes. We looked at a variety of language-related tasks, such as following instructions or retrieving facts, and found that certain internal mechanisms that have causal effects on model behaviors are particularly useful. We tested two ways of using these internal mechanisms to predict whether the model’s answers would be correct. One checks if key components in the causal mechanisms were triggered (counterfactual simulation), and the other looks at the values of these components to make predictions (value probing). Both approaches worked well, especially when the model is processing unfamiliar data.
This shows that understanding a model’s internal workings can help us predict its behavior more reliably—an important step toward safer, more trustworthy AI.
Link To Code: https://github.com/explanare/ood-prediction
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Causal Abstraction, Causal Interpretability, OOD, Correctness Prediction
Submission Number: 15536
Loading