Keywords: offline reinforcement learning, pseudo-labeling
TL;DR: We propose Action-Space Pseudo-Labeling (ASPL), a simple method that regularizes offline RL with pseudo-labels for unseen actions, yielding stable and consistent gains without fragile tuning.
Abstract: he critical challenge of offline reinforcement learning (offline RL) is improving
from a fixed dataset while avoiding overestimation on out-of-distribution (OOD)
actions. Existing methods typically regularize the learned policy to avoid choosing overestimated OOD actions. However, we argue that this often over-constrains policy improvement or requires sensitive hyperparameter tuning. We restate this challenge as the absence of explicit training signals for the
value function in parts of the state–action space. A more effective approach is to provide explicit training signals across the entire action space to eliminate overestimation. We introduce a surprisingly simple yet effective method: $\textbf{Action-Space Pseudo-Labeling (ASPL)}$ to resolve this challenge. It completes the value-function’s
missing signals by assigning pseudo Q-targets that decrease with distance from
the behavior support (i.e., the support of the behavior policy). In practice, ASPL achieves an implicit behavior-aware regularization that strengthens as behavior likelihood decreases. On D4RL datasets,
we observe stable training and consistent improvements over strong offline baselines with minor tuning burden. Code for reproducing the experiments is provided in the supplementary material.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 10354
Loading