Track: tiny / short paper (up to 4 pages)
Keywords: reasoning benchmark, algorithmic reasoning
Abstract: Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution. A major limitation of
current reasoning benchmarks is that many primarily test whether a model
can produce a valid answer, while paying less attention to whether the
solution is minimal, robust, and stable under controlled difficulty scaling.
We introduce RecurrReason, a difficulty-controlled benchmark of
four recurrent logic puzzles (Tower of Hanoi, River Crossing, Block World,
and Checkers Jumping) with BFS-verified expert trajectories and a single
interpretable difficulty parameter $N \in \{1,\dots,10\}$, totalling
10,817 unique puzzles and 280,106 moves.
We benchmark two Transformer families, an encoder-decoder model
(T5-style) and a decoder-only model (GPT-2-style), under consistent data
splits and evaluation criteria, training on $N{=}1$ to $7$ and evaluating on
both held-out in-distribution instances and harder out-of-distribution
instances at $N{=}8$ to $10$.
Fine-tuned pre-trained T5 achieves 97.27\% validation and 81.00\% OOD
accuracy on Block World; all models score 0.00\% on River Crossing under all conditions.
Failure mode analysis reveals that architecture is a stronger determinant
of success than scale; pre-training transfers only to puzzles with locally
structured transition functions. Our code and dataset will be open-sourced upon acceptance.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 179
Loading