Keywords: AlphaZero, Monte Carlo Tree Search, self-play reinforcement learning, game-theoretic learning, oracle evaluation, perfect play
TL;DR: AlphaZero can play superhumanly without being perfect. In Connect Four and Chomp, it often misses exact optimal lines, and richer inputs do not fix this. AZAL closes the gap, reaching perfect or near-perfect play through stronger supervision.
Abstract: AlphaZero has demonstrated that a neural-guided Monte Carlo Tree Search can achieve superhuman performance, but strong play
does not necessarily imply perfect play. We study this gap in two
oracle-evaluable domains with contrasting structure: Connect Four, a solved
partisan game with exact game-theoretic values, and Chomp, an impartial game
whose optimal play is governed by Grundy-number structure. Under a unified
self-play $+$ MCTS pipeline, we compare vanilla AlphaZero, a multi-frame variant, and an AlphaZero Auxiliary Loss (AZAL) that adds
oracle-derived policy supervision. We find that vanilla AlphaZero achieves
strong play across both domains but cannot preserve the exact
trajectories required for optimal play: in Connect Four, it fails to maintain
the optimal line of play, while in Chomp, it fails to consistently restore the
$g=0$ invariant. Multi-frame inputs alone do not remove this gap. Nevertheless, AZAL
substantially improves optimality recovery, reaching perfect play in Chomp and
near-perfect play in Connect Four. These results suggest that the main
bottleneck lies less in the weakness of the standard AlphaZero search-learning signal.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 2
Loading