Offline Reinforcement Learning with Penalized Action Noise Injection

Offline Reinforcement Learning with Penalized Action Noise Injection

ICLR 2026 Conference Submission18080 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Q-learning, Offline RL

Abstract: Offline reinforcement learning (RL) seeks to optimize policies from fixed datasets, enabling deployment in domains where environment interaction is costly or unsafe. A central challenge in this setting is the overestimation of out-of-distribution (OOD) actions, which arises when Q-networks assign high values to actions absent from the dataset. To address this, we propose Penalized Action Noise Injection (PANI), a lightweight Q-learning approach that perturbs dataset actions with controlled noise to increase action-space coverage while introducing a penalty proportional to the noise magnitude to mitigate overestimation. We theoretically show that PANI is equivalent to Q-learning on a Noisy Action Markov Decision Process (NAMDP), providing a principled foundation for its design. Importantly, PANI is algorithm-agnostic and requires only minor modifications to existing off-policy and offline RL methods, making it broadly applicable in practice. Despite its simplicity, PANI achieves substantial performance improvements across various offline RL benchmarks, demonstrating both effectiveness and practicality as a drop-in enhancement.

Primary Area: reinforcement learning

Submission Number: 18080

Loading