Causal Multi-Objective Reinforcement Debiasing for Large Language Models

Causal Multi-Objective Reinforcement Debiasing for Large Language Models

ACL ARR 2025 May Submission297 Authors

10 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) often generate outputs with social biases, and existing mitigation techniques tend to degrade task performance. Building on the MOMA framework, we introduce a novel Causal Multi-Objective Reinforcement Debiasing (CMOR) method that dynamically trades off accuracy and fairness. CMOR formulates bias mitigation as a multi-objective optimization where an agent sequentially transforms the prompt via masked replacements and context insertions to ``cut'' spurious causal links between sensitive content and outputs. CMOR overcomes MOMA’s limitations (semantic loss from rigid masks, fixed bias words, and high cost from multiple agents) by learning soft, context-aware interventions and requiring only two model calls per query. Experiments on 2 benchmarks datasets show that CMOR achieves a Pareto-superior trade-off: it reduces bias scores close to MOMA while preserving higher accuracy. For example, on BBQ we cut bias by over 80\% with less than 2\% accuracy loss, outperforming baselines such as CoT, Self-Consistency, and Society-of-Mind. These results demonstrate CMOR’s effectiveness in jointly optimizing fairness and utility in LLMs.

Paper Type: Short

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/fairness evaluation, reinforcement learning, causality, prompting, safety and alignment

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 297

Loading