Decoupling Policy Improvement and Conservatism in Offline Reinforcement Learning

Wen Jiang; Ke Jiang; Xiaoyang Tan

Decoupling Policy Improvement and Conservatism in Offline Reinforcement Learning

Wen Jiang, Ke Jiang, Xiaoyang Tan

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Reinforcement Learning

Abstract: Offline reinforcement learning (RL) methods suffer from \textit{extrapolation error}. However, current solutions face a dilemma - static constraints are over-conservative, while naive dynamic references perilously couple policy improvement with conservatism, which can lead the reference itself to an out-of-distribution (OOD) target. Our core insight is that these two objectives must be decoupled. In this paper, we introduce the \textbf{C}overage-\textbf{A}ware \textbf{R}eference Policy (CAR), which instantiates this principle via a propose-and-verify mechanism: the learned policy proposes actions, and a verifier confirms data support before augmenting a reference policy. This tractably creates a progressively improving reference with theoretical coverage guarantees. Finally, CAR establishes state-of-the-art (SOTA) results on offline RL benchmarks and strong results in online fine-tuning.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 10692

Loading