Causally-Enhanced Reinforcement Policy Optimization of Large Language Models

Xiangqi Wang; Yue Huang; Yujun Zhou; Xiaonan Luo; Kehan Guo; Xiangliang Zhang

Causally-Enhanced Reinforcement Policy Optimization of Large Language Models

Xiangqi Wang, Yue Huang, Yujun Zhou, Xiaonan Luo, Kehan Guo, Xiangliang Zhang

03 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Reinforcement Learning, Causal Inference

TL;DR: We propose Causally-Enhanced Policy Optimization (CE-PO), a reinforcement learning framework for large language models that integrates causal inference signals with standard rewards to improve reasoning coherence and mitigate reward hacking.

Abstract: Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt ($Z$) to rationale ($X$) to answer ($Y$). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation--causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49 % on average (up to 9.58 %), while improving robustness to correlation–causation flips and light counterfactual edits.

Primary Area: reinforcement learning

Submission Number: 1776

Loading