Keywords: reinforcement learning, bellman value target, lower bound, discounted return
Abstract: We show that an arbitrary lower bound of the optimal value function can be used to improve the Bellman value target during value learning. In the tabular case, value learning under the lower bounded Bellman operator converges to the same optimal value as under the original Bellman operator, at a potentially faster speed. In practice, discounted episodic return from the training experience or discounted goal return from hindsight relabeling can serve as the value lower bound when the environment is deterministic. This is because the empirical episodic return from any state can always be repeated through the same action sequence in a deterministic environment, thus a lower bound of the optimal value from the state. We experiment on Atari games, FetchEnv tasks and a challenging physically simulated car push and reach task. We show that in most cases, simply lower bounding with the discounted episodic return performs at least as well as common baselines such as TD3, SAC and Hindsight Experience Replay (HER). It learns much faster than TD3 or HER on some of the harder continuous control tasks, requiring minimal or no parameter tuning.
One-sentence Summary: Speed up RL learning by lower bounding the Bellman value target with empirical episodic return
22 Replies
Loading