Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

Motoki Omura; Kazuki Ota; Takayuki Osa; Yusuke Mukuta; Tatsuya Harada

Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada

Published: 01 May 2025, Last Modified: 13 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality. The code for this study is available at https://github.com/motokiomura/annealed-q-learning.

Lay Summary: This study proposes a new approach to improve the learning efficiency of agents in environments with continuous action spaces, such as robotic control tasks. Traditional methods either accelerate learning at the risk of overestimating outcomes, or ensure stability but with slower progress. To balance these trade-offs, we introduce a method that gradually transitions from a fast-learning strategy to a more stable one over time. This is achieved using a technique called expectile loss, which enables a smooth interpolation between the two learning strategies. The proposed method, Annealed Q-learning, integrates seamlessly with existing reinforcement learning algorithms. Experimental results across a range of control tasks demonstrate that agents trained with this method learn more effectively than those using standard approaches. Overall, our method promotes faster and more reliable learning in continuous action domains

Link To Code: https://github.com/motokiomura/annealed-q-learning

Primary Area: Reinforcement Learning->Deep RL

Keywords: online reinforcement learning, q-learning, bellman operator

Submission Number: 9502

Loading