Abstract: AlphaZero has demonstrated that a neural-guided Monte Carlo Tree Search can
achieve superhuman performance, but strong play
does not necessarily imply perfect play. We study this gap in two
oracle-evaluable domains with contrasting structure: Connect Four, a solved
partisan game with exact game-theoretic values, and Chomp, an impartial game
whose optimal play is governed by Grundy-number structure. Under a unified
self-play $+$ MCTS pipeline, we compare vanilla AlphaZero, a multi-frame variant, and an AlphaZero Auxiliary Loss (AZAL) that adds
oracle-derived policy supervision. We find that vanilla AlphaZero achieves
strong play across both domains but cannot preserve the exact
trajectories required for optimal play: in Connect Four, it fails to maintain
the optimal line of play, while in Chomp, it fails to consistently restore the
$g=0$ invariant. Multi-frame inputs alone do not remove this gap. Nevertheless, AZAL
substantially improves optimality recovery, reaching perfect oracle consistency on the evaluated Chomp traces and near-perfect
oracle consistency on the evaluated Connect Four trace. These results suggest that, in these oracle-evaluable settings, a major bottleneck is the weakness of the standard AlphaZero search-learning signal.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Made supplementary material anonymous.
Assigned Action Editor: ~Julian_Zimmert1
Submission Number: 8782
Loading