TL;DR: We propose PARS, combining reward scaling with layer normalization and penalizing infeasible actions, achieving SOTA performance in offline and offline-to-online RL. It is the only algorithm to successfully learn Antmaze Ultra in both phases.
Abstract: Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
Lay Summary: How can we prevent AI from overestimating the value of data it hasn't seen?
AI learns the value of actions from the data it has observed, but this understanding is limited to the training distribution. When it encounters situations just beyond that range, especially if the data ends on an upward trend, it may mistakenly assume that unseen actions will have even higher value. In reality, such areas are often unpredictable, and overestimating their value can lead to poor or unsafe decisions.
This paper aims to address the problem of incorrect extrapolation by encouraging the AI to naturally lower its value estimates for actions outside the data boundary. This helps prevent overly optimistic predictions. To this end, it proposes two key techniques:
- **Reward scaling**: By increasing the scale of rewards and applying normalization, the AI learns to more clearly distinguish in-distribution actions from those outside. This reduces the influence of training signals on out-of-distribution actions, helping to keep predictions in those areas lower.
- **Penalizing infeasible actions**: Actions that are far from the feasible region or clearly unrealistic are explicitly assigned low values, encouraging the values of actions outside the data range to be pulled downward naturally.
These two strategies are combined in a simple yet effective algorithm called PARS.
PARS outperforms existing methods across a wide range of tasks and helps the AI make safer, more reliable decisions in complex environments. It also performs well when the AI continues learning from new data. PARS allows smooth adaptation without losing the stability gained from prior training.
Primary Area: Reinforcement Learning
Keywords: Offline-to-Online Reinforcement Learning, Offline Reinforcement Learning, Penalizing Infeasible Actions, Layer Normalization, Reward Scaling
Submission Number: 10373
Loading