Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

Published: 26 Jan 2026, Last Modified: 13 May 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: deep reinforcement learning, q-learning, actor-critic, function approximation
TL;DR: MINTO is a simple, yet effective target bootstrapping method for temporal-difference RL that enables faster, more stable learning and consistently improves performance across algorithms and benchmarks.
Abstract: The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the **MIN**imum estimate between the **T**arget and **O**nline network, giving rise to our method, **MINTO**. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 24658
Loading