Neuron-Level Linguistic Selectivity in LLMs via a Classifier-Free Framework

Published: 23 Sept 2025, Last Modified: 20 Nov 2025UniReps2025 oralEveryoneRevisionsBibTeXCC BY 4.0
Track: Extended Abstract Track
Keywords: Interpretability, Large Language Models, Minimal Pairs, Neuroscience-Inspired
TL;DR: We introduce a probe-free method using minimal pairs to map neuron-level linguistic selectivity in LLMs, revealing domain-specific and domain-general functional organization.
Abstract: Understanding how Large Language Models (LLMs) encode linguistic structures remains a fundamental challenge in interpretability research. While diagnostic classifiers (or "probes") are the standard tool for this task, they face significant methodological criticism: training auxiliary classifiers introduces capacity confounds and calibration issues, often making it difficult to distinguish the model's intrinsic representations from the probe's ability to learn the task. To address these limitations, we introduce a probe-free framework for localizing linguistic selectivity at the individual neuron level. Leveraging the controlled contrasts of linguistic minimal pairs, we propose Minimal-Pair Neuron Separability (MPNS), a metric that directly quantifies how reliably single neurons differentiate grammatical from ungrammatical constructions without parameter updates. By applying this framework to the Qwen3 model, we uncover a distinct functional hierarchy: syntactic and morphological processing is concentrated in early-to-mid layers, whereas semantic-syntactic interfaces and conceptual reasoning emerge in deeper layers . Furthermore, hierarchical clustering of sensitive neurons reveals a modular internal organization, identifying both domain-specific "specialists" and domain-general "integrators". Our approach yields fine-grained, interpretable maps of linguistic competence, offering a rigorous alternative to probing for mechanistic analysis.
Submission Number: 71
Loading