Keywords: probing, misinformation, neurons
Abstract: Regret in large language models (LLMs) refers to their explicit expression of regret when confronted with evidence that contradicts previously generated misinformation. Understanding the neural encoding of regret and its underlying mechanisms is crucial for advancing our knowledge of artificial metacognition and improving model reliability. To understand how regret is encoded, we must first identify regret expressions in model outputs and subsequently analyze their internal representations. This analysis necessitates an examination of the model's hidden states, where information processing occurs at the neuronal level. However, this endeavor faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics for identifying optimal layers for regret representation, and (3) the absence of methods for identifying and analyzing regret-related neurons. To address these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset via strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) for identifying optimal layers for regret representation, and (3) the Regret Dominance Score (RDS) for identifying regret-related neurons, along with the Group Impact Coefficient (GIC) for analyzing their activation patterns. Leveraging these metrics, we uncover a cross-layer S-CDI oscillatory decoupling pattern curve and a combinatorial encoding mechanism involving regret neurons, non-regret neurons, and dual-function neurons. Building on these findings, we develop an intervention framework to validate our understanding of regret coding. Guided by the S-CDI curve, we select compositionally encoded regret neurons located at optimal layers as anchors, apply gradient-based attribution to identify related cross-layer neurons, and perform controlled interventions to verify our mechanistic understanding. This work provides neuron-level insights into artificial metacognition and offers methodological tools for analyzing complex cognitive states in LLMs, thereby advancing our understanding of how such mechanisms emerge in large language models.
Primary Area: interpretability and explainable AI
Submission Number: 4062
Loading