Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models

Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models

ACL ARR 2025 February Submission7780 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Self-Correction based on feedback improves the output quality of Large Language Models (LLMs) and can potentially reduce social biases such as those related to gender and race. LLMs are sensitive to contextual ambiguities and inconsistencies, which can lead to the amplification of those biases. Therefore, when using Self-Correction for debiasing, it is crucial to ensure that the intentions of the LLMs are explicitly communicated during their interactions. In this study, we demonstrate that clarifying intentions is essential for effectively reducing biases in LLMs through Self-Correction. We divide the components needed for Self-Correction into three parts: instruction, response, and feedback, and clarify intentions at each component. We incorporate an explicit debiasing prompt to convey the intention of bias mitigation from the instruction for response generation. In the response, we use Chain-of-Thought (CoT) to clarify the reasoning process. In the feedback, we define evaluation aspects necessary for debiasing and propose clear feedback through multi-aspect critiques and scoring. Through experiments, we demonstrate that self-correcting CoT responses obtained from a debiasing prompt based on multi-aspect feedback can reduce biased responses more robustly and consistently than the baselines. We also find the variation in debiasing efficacy when using models with different bias levels or separating models to generate response and feedback.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Social Biases, Bias Mitigation, Self-Correction

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7780

Loading