A Deep Proactive Exploration Policy Based on Asymptotic Statistics for Asynchronous Q-Learning

Published: 28 Nov 2025, Last Modified: 30 Nov 2025NeurIPS 2025 Workshop MLxOREveryoneRevisionsBibTeXCC BY 4.0
Keywords: Q-learning, exploration policy, central limit theorem, deep learning
TL;DR: Deep proactive learning for Q-learning exploration policy
Abstract: This paper presents a new methodology that adaptively optimizes exploration policy for the reinforcement learning problem. We shift the objective from value-function accuracy to direct policy identification, using the probability of correct selection as a metric. We establish a new central limit theorem for asynchronous Q-learning with adaptive step sizes and propose a regularized signal-to-noise ratio index for exploration policy designing. To address the computational cost of the high-dimensional optimization, we propose a novel pipeline with an offline, simulation-based proactive learning loop. This loop trains a deep neural network to serve as a fast, low-dimensional proxy for the complex optimization problem, allowing for efficient online policy updates in real-world applications. We validate our approach on the challenging RiverSwim environment, demonstrating superior performance compared to standard exploration heuristics.
Submission Number: 125
Loading