Keywords: large language models, reasoning, compositional generalization
TL;DR: Atomic chess exposes that reasoning LLMs can state and locally apply altered rules, but still often fail to integrate them into globally correct action selection when those rules conflict with standard priors.
Abstract: Large language models are capable of playing chess at a notable level, but it is unclear whether this reflects reasoning or memorized priors. We study whether models can compose an explicit rule change using atomic chess, a variant that preserves the board and piece notation while altering the rules: every capture explodes adjacent pieces, turning many standard tactics into mistakes. We construct a diagnostic paired evaluation of $200$ variant-divergent positions drawn from standard and atomic chess games. Every position is selected so that the best move under atomic rules differs from the best move under standard rules, and the standard-best move is a blunder under atomic rules. Across Claude Opus 4.6 and GPT-5.4, atomic Win\% loss exceeds standard by up to $4.6\times$ on identical positions. Increasing reasoning compute reduces atomic mean Win\% loss, but does not eliminate the gap. Qualitative analysis of reasoning traces reveals failure patterns including board representation and composition, most notably \emph{unpropagated refutation}, where a model recognizes that a candidate move is bad under atomic rules but selects it anyway. These results suggest that reasoning LLMs can partially use explicit compositional rules, but sometimes fail to override familiar action priors.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 168
Loading