Keywords: Mechanistic Interpretability; Large Language Models;
Abstract: Unintended code-switching, which refers to the phenomenon where LLM unexpectedly switch languages, poses a fundamental challenge in the multilingual capabilities in LLMs.
However, the fundamental properties of their underlying circuits, such as what they consist of, where they emerge in the network, and how to mitigate their effects, remain unexplored.
Existing works on the mechanistic interpretability depend on additional training (e.g., sparse autoencoders) or manual annotation, both of which pose limitations in real-world scenarios.
In this work, we introduce a scalable circuit discovery framework that causally localizes multilingual neurons, describes their functional patterns, and groups neurons into circuits.
We find that the circuits for multilingual generation fall into two different regimes: a language regime which acts as a lingual key to detect language patterns, and a semantic regime which functions as a contextual value to retrieving language-agnostic semantics.
These two regimes, in normal cases, converge smoothly to make final predictions, but in code-switching scenarios, semantics dominate the circuit, overriding typical language pathways and destabilizing outputs.
Furthermore, we fine-tune the identified language sub-circuit ($\sim0.019$\% of all neurons), reducing the code-switching rate by $20.8$\% with minimal parameter updates, validating the effectiveness of the discovered circuits for practical scalability. Our work serves as a preliminary exploration of multilingual generation circuits, offering actionable insights for neuron-based mechanistic interpretability.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11933
Loading