AlphaZero in Sparsely Rewarded Games: Limits and Auxiliary Supervision

TMLR Paper8782 Authors

06 May 2026 (modified: 26 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: AlphaZero has demonstrated that a neural-guided Monte Carlo Tree Search can achieve superhuman performance, but strong play does not necessarily imply perfect play. We study this gap in two oracle-evaluable domains with contrasting structure: Connect Four, a solved partisan game with exact game-theoretic values, and Chomp, an impartial game whose optimal play is governed by Grundy-number structure. Under a unified self-play $+$ MCTS pipeline, we compare vanilla AlphaZero, a multi-frame variant, and an AlphaZero Auxiliary Loss (AZAL) that adds oracle-derived policy supervision. We find that vanilla AlphaZero achieves strong play across both domains but cannot preserve the exact trajectories required for optimal play: in Connect Four, it fails to maintain the optimal line of play, while in Chomp, it fails to consistently restore the $g=0$ invariant. Multi-frame inputs alone do not remove this gap. Nevertheless, AZAL substantially improves optimality recovery, reaching perfect oracle consistency on the evaluated Chomp traces and near-perfect oracle consistency on the evaluated Connect Four trace. These results suggest that, in these oracle-evaluable settings, a major bottleneck is the weakness of the standard AlphaZero search-learning signal.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Made supplementary material anonymous.
Assigned Action Editor: ~Julian_Zimmert1
Submission Number: 8782
Loading