On the Convergence of Intrinsic Self-Correction in Large Language Models: Latent Concept and Model Uncertainty

On the Convergence of Intrinsic Self-Correction in Large Language Models: Latent Concept and Model Uncertainty

ACL ARR 2025 February Submission1559 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. In this paper, we reveal a key characteristic of intrinsic self-correction—convergent performance through multi-round interactions—and provide a mechanistic analysis of this convergence behavior. Our findings are verified in: (1) intrinsic self-correction can progressively introduce performance gains through iterative interactions, ultimately converging to stable performance in various tasks; (2) mechanistic analysis to intrinsic self-correction for enhanced morality, in which we provide empirical evidence that iteratively applying instructions reduces model uncertainty, which then leads to convergence of the calibration error, ultimately resulting in a convergent performance of intrinsic self-correction; (3) a mathematical simulation indicating that the latent concepts activated by self-correction instructions drive the reduction of model uncertainty. Based on our experimental results and analysis of intrinsic self-correction convergence, we uncover its underlying mechanism: consistently injected moral instructions reduce model uncertainty, leading to improved calibration error and ultimately achieving convergent self-correction performance.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: self-correction, large language models, morality, toxicity, social bias

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 1559

Loading