On the Convergence of Moral Self-Correction in Large Language Models

On the Convergence of Moral Self-Correction in Large Language Models

ACL ARR 2025 July Submission25 Authors

16 Jul 2025 (modified: 02 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: probing,uncertainty,calibration

Contribution Types: Model analysis & interpretability

Languages Studied: English

Previous URL: https://openreview.net/forum?id=JqFuE9HVAu

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: The expertise of AC and reviewers are different from the research topic in this paper. One example is that they do not really understand what is concept in the bayesian analysis to in-context learning.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 3

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: Appendix C.3

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Appendix c.1

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 3

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 3

C4 Parameters For Packages: Yes

C4 Elaboration: Section 3

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 25

Loading