Abstract: Offline reinforcement learning confronts the distributional shift challenge, a consequence of learning policy from static datasets. Current methods primarily handle this issue by aligning the learned policy with the behavior policy or conservatively estimating Q-values for out-of-distribution (OOD) actions. However, these approaches can lead to overly pessimistic estimation of Q-values of the OOD actions in unfamiliar situations, resulting in a suboptimal policy. To address this, we propose a new method, Dynamic Uncertainty estimation for Offline Reinforcement Learning. This method introduces a base density-truncated OOD data sampling approach to reduce the impact of extrapolation errors on uncertainty estimation. It enables conservative estimation of Q-values for OOD actions while avoiding negative impacts on in-distribution data. We also develop a dynamic uncertainty estimation mechanism to prevent excessive pessimism and enhance the generalization of the Q-function. This mechanism dynamically adjusts the degree of pessimism in the Q-function by minimizing the error between target and estimated values. Our method outperforms existing algorithms, as demonstrated by experimental results based on the D4RL benchmark, and proves its superiority in addressing the distributional shift challenge.
Loading