Plan Deeply or Estimate Precisely?: A Resource-Aware AlphaZero with Dynamic Quantile Allocation

Seunghee Lee; Dongjae Kim

Plan Deeply or Estimate Precisely?: A Resource-Aware AlphaZero with Dynamic Quantile Allocation

Seunghee Lee, Dongjae Kim

20 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, alphazero, distributional reinforcement learning, Monte Carlo Tree Search, planning

TL;DR: We built an AlphaZero that can decide how to spend its limited "thinking budget": either "think deeper" (more MCTS searches) or "think clearer" (get a more precise value estimate using distributional RL).

Abstract: AlphaZero integrates deep reinforcement learning (RL) with Monte Carlo Tree Search (MCTS) and has demonstrated remarkable performance in combinatorial games. MCTS enables deep planning by leveraging learned value estimates, but in vast state spaces, these estimates require extensive sampling and often exhibit high uncertainty. While this can be mitigated with massive computational resources, such an approach is often impractical and presents two key challenges: a need for greater computational efficiency to achieve strong performance under realistic constraints, and the tendency for resource-constrained agents to develop strategies that deviate from human heuristics. In this work, we address these twin challenges by incorporating distributional RL into MCTS, replacing the scalar value estimate with a probability distribution via quantile regression. During its search, our agent dynamically increases the number of quantiles until the ``action gap''---the difference between the best and second-best action values---exceeds a predefined confidence threshold. This mechanism enables the agent to autonomously trade off between deeper planning and lower value-estimation uncertainty within a fixed computational budget. We evaluated our proposed model on Four-in-a-Row---a game whose intermediate-sized state space is large enough to expose efficiency gains yet small enough to measure them precisely---and compared it with several AlphaZero variants. The model achieved higher performance while consuming fewer resources and developed effective policies with greater sample efficiency. Moreover, the model's behavioral patterns more closely resembled human heuristics compared to the other AlphaZero variants, suggesting that \textit{how} an agent allocates its cognitive budget is crucial for emulating human-like heuristics.

Supplementary Material: pdf

Primary Area: reinforcement learning

Submission Number: 24431

Loading