Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

ICLR 2026 Conference Submission18686 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline reinforcement learning, Offline-to-online settings, Multi-step operator

TL;DR: This paper introduces CPQL: Conservative Peng's Q($\lambda$), mitigates overly-pessimistic value estimation, achieves the performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees.

Abstract: We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($\lambda$) (CPQL). Our algorithm adapts the Peng's Q($\lambda$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with the *multi-step* operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees --- a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the framework of offline-to-online learning. Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and attain robust performance improvement.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 18686

Loading