Towards Multilingual Mechanistic Interpretability

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, Multilingual
Abstract: Multilingual language models achieve strong averages yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants, and we introduce \emph{triangulation}, an acceptance rule requiring (i) invariance of the conditional law of a task score given the internal states of a proposed subgraph and (ii) directional stability and sufficient magnitude of interventional effects across references. To supply candidate subgraphs, we adopt \emph{automatic circuit discovery} (edge attribution patching, position-aware circuit discovery, and sparse subgraph selection), and we \emph{accept or reject} those candidates by triangulation. Our proposal situates mechanistic interpretability within the theory of causal abstraction and complements causal mediation analyses by focusing on falsifiable cross-environment invariance.
Submission Number: 97
Loading