Causal Proximal Policy Optimization

ICLR 2026 Conference Submission22690 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning from Human Feedback, Causal Inference, Back-Door Adjustment, Reward Modeling, Policy Optimization
TL;DR: We propose CPPO, a causal RLHF method that applies back-door–adjusted rewards during PPO to reduce demographic bias.
Abstract: In this paper, we address the problem of bias mitigation in Reinforcement Learning from Human Feedback (RLHF) within the framework of causal inference. Existing approaches typically focus on prompt engineering or isolated reward modeling, and they often fail to address prompt-level confounding that affects both model responses and reward signals. Our work introduces Causal Proximal Policy Optimization (CPPO), a unified framework that models prompt-based confounders and integrates them into both reward learning and policy training. By predicting confounders from the prompt and applying back-door adjustment, CPPO removes spurious correlations on the causal path from responses to rewards. This approach removes the reliance on mediators or adversarial optimization and enables confounder-aware policy updates. We demonstrate that CPPO improves robustness to demographic and representational biases on the DiscrimEval benchmark, outperforming existing methods.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22690
Loading