Moral Self-correction is Not An Innate Capability in Language Models

Moral Self-correction is Not An Innate Capability in Language Models

ACL ARR 2025 July Submission128 Authors

23 Jul 2025 (modified: 22 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Although there has been growing interest in the self-correction capabilities of Large Language Models (LLMs), there are varying conclusions about its effectiveness. Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs' moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: bias,toxicity,moral

Contribution Types: Model analysis & interpretability

Languages Studied: English

Previous URL: https://openreview.net/forum?id=TzJRq8wo6D

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: Some reviewers keep criticing the notation of methods without detailed questions about the content of this paper.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 2

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: N/A

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 2

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 2

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 4

C4 Parameters For Packages: Yes

C4 Elaboration: Section 2

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 128

Loading