Identifying Good and Bad Neurons for Task-Level Controllable LLMs

Wenjie Li; Guansong Pang; Debin Gao; David Lo

Identifying Good and Bad Neurons for Task-Level Controllable LLMs

Wenjie Li, Guansong Pang, Debin Gao, David Lo

18 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neuron Localization, LLM Interpretability

Abstract: Despite their remarkable capabilities, the complex mechanisms by which neurons influence Large Language Models (LLMs) remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons accounting for target performance while neglecting neurons with other roles, e.g., inhibitive roles, resulting in an incomplete view of LLMs in task execution. Also, they are often customized for specific data structures, lacking flexibility for diverse tasks with varying input-output formats. To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification, with the key insight that task performance is jointly determined by neurons with two opposing roles: “good” neurons that facilitate task completion and “bad” neurons that inhibit it. NeuronLLM is instantiated by two main modules: Question-Answering-based Task Transformation (QATT) and Contrastive Neuron Identification (CNI). QATT transforms diverse tasks into unified question-answering format, enabling NeuronLLM to understand LLMs under different tasks; CNI identifies good and bad neurons via a new cross-entropy-based con- trastive scoring method, featuring a holistic view of neuron analysis. Comprehensive experiments on LLMs of different sizes and families show that NeuronLLM substantially outperforms existing methods in identifying task-relevant neurons across four NLP tasks, providing new insights into LLM functional organization.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 11708

Loading