Keywords: Q-learning, Offline RL
Abstract: Offline reinforcement learning (RL) seeks to optimize policies from fixed datasets, enabling deployment in domains where environment interaction is costly or unsafe. A central challenge in this setting is the overestimation of out-of-distribution (OOD) actions, which arises when Q-networks assign high values to actions absent from the dataset. To address this, we propose Penalized Action Noise Injection (PANI), a lightweight Q-learning approach that perturbs dataset actions with controlled noise to increase action-space coverage while introducing a penalty proportional to the noise magnitude to mitigate overestimation. We theoretically show that PANI is equivalent to Q-learning on a Noisy Action Markov Decision Process (NAMDP), providing a principled foundation for its design. Importantly, PANI is algorithm-agnostic and requires only minor modifications to existing off-policy and offline RL methods, making it broadly applicable in practice. Despite its simplicity, PANI achieves substantial performance improvements across various offline RL benchmarks, demonstrating both effectiveness and practicality as a drop-in enhancement.
Primary Area: reinforcement learning
Submission Number: 18080
Loading