Keywords: Offline Reinforcement Learning
Abstract: Offline reinforcement learning (RL) methods suffer from \textit{extrapolation error}. However, current solutions face a dilemma - static constraints are over-conservative, while naive dynamic references perilously couple policy improvement with conservatism, which can lead the reference itself to an out-of-distribution (OOD) target. Our core insight is that these two objectives must be decoupled. In this paper, we introduce the \textbf{C}overage-\textbf{A}ware \textbf{R}eference Policy (CAR), which instantiates this principle via a propose-and-verify mechanism: the learned policy proposes actions, and a verifier confirms data support before augmenting a reference policy. This tractably creates a progressively improving reference with theoretical coverage guarantees. Finally, CAR establishes state-of-the-art (SOTA) results on offline RL benchmarks and strong results in online fine-tuning.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 10692
Loading