Uncovering Neuronal Mechanisms of Intrinsic Self-Debiasing in Large Language Models via Contrastive Learning

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability; Bias; Safety
Abstract: With the advancement of alignment techniques, large language models (LLMs) have demonstrated the intrinsic self-debiasing capability against stereotypes. However, our understanding of the underlying mechanism remains limited, which significantly hinders the development of trustworthy AI. In the field of LLM safety, prior studies have shown that defense against explicitly harmful queries is governed by a sparse set of critical neurons. These neurons typically exhibit a strong activation response when processing malicious inputs—a phenomenon known as $\textit{explicit induction}$. Nevertheless, this paradigm fails to capture implicit hazards, particularly stereotypical biases, which operate via $\textit{implicit association}$: shifts in neuronal response patterns across different social contexts, not mere activation strength. Based on this insight, we propose $\textbf{\textit{COCO}}$, a $\textit{contrastive learning-based}$ method focusing on identifying self-debiasing neurons possessing $\textit{intra-$\underline{co}$nsistency and inter-$\underline{co}$ntrast}$ (termed $\textbf{\textit{COCO Neurons}}$). Our findings reveal that these COCO neurons account for approximately 1\% of the total neurons and are primarily located in the Query and Value weight matrices of the deeper network layers. To effectively leverage COCO neurons, we draw inspiration from Neurodynamics and abstract the intrinsic self-debiasing capability within LLMs into two distinct systems: linear debiasing system and nonlinear debiasing system, for which we design tailored neuron enhancement editing strategies, $\textbf{\textit{LE-COCO}}$ and $\textbf{\textit{NE-COCO}}$. Experimental results across six social categories demonstrate that the success rate of Llama3-8B in resisting stereotypical biases increases to nearly 90\% after linear enhancement, with a maximum gain of over 50\%. Meanwhile, Mistral-7B with nonlinear enhancement achieves an average gain of 10\% in its success rate of resisting stereotypical biases, with a maximum gain of 23\%. Furthermore, generalization experiments reveal that the enhanced models exhibit not only stronger robustness against jailbreak attacks but also measurable improvements on factual and reasoning benchmarks.
Supplementary Material: pdf
Primary Area: interpretability and explainable AI
Submission Number: 3863
Loading