- Abstract: The Boltzmann softmax operator can trade-off well between exploration and exploitation according to current estimation in an exponential weighting scheme, which is a promising way to address the exploration-exploitation dilemma in reinforcement learning. Unfortunately, the Boltzmann softmax operator is not a non-expansion, which may lead to unstable or even divergent learning behavior when used in estimating the value function. The non-expansion is a vital and widely-used sufficient condition to guarantee the convergence of value iteration. However, how to characterize the effect of such non-expansive operators in value iteration remains an open problem. In this paper, we propose a new technique to analyze the error bound of value iteration with the the Boltzmann softmax operator. We then propose the dynamic Boltzmann softmax(DBS) operator to enable the convergence to the optimal value function in value iteration. We also present convergence rate analysis of the algorithm. Using Q-learning as an application, we show that the DBS operator can be applied in a model-free reinforcement learning algorithm. Finally, we demonstrate the effectiveness of the DBS operator in a toy problem called GridWorld and a suite of Atari games. Experimental results show that outperforms DQN substantially in benchmark games.
- Keywords: Reinforcement Learning, Boltzmann Softmax Operator, Value Function Estimation