Keywords: Neuron Localization, LLM Interpretability
Abstract: Despite their remarkable capabilities, the complex mechanisms by which neurons influence Large Language Models (LLMs) remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons accounting for target performance while neglecting neurons with other
roles, e.g., inhibitive roles, resulting in an incomplete view of LLMs in task execution. Also, they are often customized for specific data structures, lacking flexibility for diverse tasks with varying input-output formats. To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification, with the key insight that task performance is jointly determined by neurons with two opposing roles: “good” neurons that facilitate task completion and “bad” neurons that inhibit it. NeuronLLM is instantiated by two main modules: Question-Answering-based Task Transformation (QATT) and Contrastive Neuron Identification (CNI). QATT transforms diverse tasks into unified question-answering format, enabling NeuronLLM to understand LLMs under different tasks; CNI identifies good and bad neurons via a new cross-entropy-based con-
trastive scoring method, featuring a holistic view of neuron analysis. Comprehensive experiments on LLMs of different sizes and families show that NeuronLLM substantially outperforms existing methods in identifying task-relevant neurons across four NLP tasks, providing new insights into LLM functional organization.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 11708
Loading