AlphaZero in Sparsely Rewarded Games: Limits and Auxiliary Supervision

Published: 07 Jun 2026, Last Modified: 07 Jun 2026ICML 2026 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AlphaZero, Monte Carlo Tree Search, self-play reinforcement learning, game-theoretic learning, oracle evaluation, perfect play
TL;DR: AlphaZero can play superhumanly without being perfect. In Connect Four and Chomp, it often misses exact optimal lines, and richer inputs do not fix this. AZAL closes the gap, reaching perfect or near-perfect play through stronger supervision.
Abstract: AlphaZero has demonstrated that a neural-guided Monte Carlo Tree Search can achieve superhuman performance, but strong play does not necessarily imply perfect play. We study this gap in two oracle-evaluable domains with contrasting structure: Connect Four, a solved partisan game with exact game-theoretic values, and Chomp, an impartial game whose optimal play is governed by Grundy-number structure. Under a unified self-play $+$ MCTS pipeline, we compare vanilla AlphaZero, a multi-frame variant, and an AlphaZero Auxiliary Loss (AZAL) that adds oracle-derived policy supervision. We find that vanilla AlphaZero achieves strong play across both domains but cannot preserve the exact trajectories required for optimal play: in Connect Four, it fails to maintain the optimal line of play, while in Chomp, it fails to consistently restore the $g=0$ invariant. Multi-frame inputs alone do not remove this gap. Nevertheless, AZAL substantially improves optimality recovery, reaching perfect play in Chomp and near-perfect play in Connect Four. These results suggest that the main bottleneck lies less in the weakness of the standard AlphaZero search-learning signal.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 2
Loading