Offline Reinforcement Learning via Action-Space Pseudo-Labeling

Yunfan Zhou; Xijun Li; Jianguo Yao; Haibing Guan

Offline Reinforcement Learning via Action-Space Pseudo-Labeling

Yunfan Zhou, Xijun Li, Jianguo Yao, Haibing Guan

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: offline reinforcement learning, pseudo-labeling

TL;DR: We propose Action-Space Pseudo-Labeling (ASPL), a simple method that regularizes offline RL with pseudo-labels for unseen actions, yielding stable and consistent gains without fragile tuning.

Abstract: he critical challenge of offline reinforcement learning (offline RL) is improving from a fixed dataset while avoiding overestimation on out-of-distribution (OOD) actions. Existing methods typically regularize the learned policy to avoid choosing overestimated OOD actions. However, we argue that this often over-constrains policy improvement or requires sensitive hyperparameter tuning. We restate this challenge as the absence of explicit training signals for the value function in parts of the state–action space. A more effective approach is to provide explicit training signals across the entire action space to eliminate overestimation. We introduce a surprisingly simple yet effective method: $\textbf{Action-Space Pseudo-Labeling (ASPL)}$ to resolve this challenge. It completes the value-function’s missing signals by assigning pseudo Q-targets that decrease with distance from the behavior support (i.e., the support of the behavior policy). In practice, ASPL achieves an implicit behavior-aware regularization that strengthens as behavior likelihood decreases. On D4RL datasets, we observe stable training and consistent improvements over strong offline baselines with minor tuning burden. Code for reproducing the experiments is provided in the supplementary material.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 10354

Loading