Offline Reinforcement Learning via Action-Space Pseudo-Labeling

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: offline reinforcement learning, pseudo-labeling
TL;DR: We propose Action-Space Pseudo-Labeling (ASPL), a simple method that regularizes offline RL with pseudo-labels for unseen actions, yielding stable and consistent gains without fragile tuning.
Abstract: he critical challenge of offline reinforcement learning (offline RL) is improving from a fixed dataset while avoiding overestimation on out-of-distribution (OOD) actions. Existing methods typically regularize the learned policy to avoid choosing overestimated OOD actions. However, we argue that this often over-constrains policy improvement or requires sensitive hyperparameter tuning. We restate this challenge as the absence of explicit training signals for the value function in parts of the state–action space. A more effective approach is to provide explicit training signals across the entire action space to eliminate overestimation. We introduce a surprisingly simple yet effective method: $\textbf{Action-Space Pseudo-Labeling (ASPL)}$ to resolve this challenge. It completes the value-function’s missing signals by assigning pseudo Q-targets that decrease with distance from the behavior support (i.e., the support of the behavior policy). In practice, ASPL achieves an implicit behavior-aware regularization that strengthens as behavior likelihood decreases. On D4RL datasets, we observe stable training and consistent improvements over strong offline baselines with minor tuning burden. Code for reproducing the experiments is provided in the supplementary material.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 10354
Loading