Abstract: Although the capabilities of Large Language Models and Large Reasoning Models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated.
In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence.
Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting non-recoverable error accumulation under single-pass autoregressive decoding.
We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting that character level counting and long yet simple state tracking might not be overcome by test-time scaling approaches alone.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Based on the provided review responses, here is a list of all the changes and additions applied to the revised manuscript:
**1. New Models and Experiments Added**
* **Added `gemini-3-pro` and `grok-4.1-fast`:** Integrated these new models into the evaluation suite to confirm that horizon-dependent degradation persists across diverse architectures. Figures 3, 4, and 5 were updated to include these models.
* **New "Freetski" Task:** Introduced a new structured planning task based on a simplified, 1D version of the Klotski puzzle to test long-horizon state tracking without branching/multi-object complexities. Results for `gemini-3-pro` on this task were added.
* **Experiments on longer horizon:** Added in appendix the study on corridor lengths from 100 to 200 for `gemini-3-pro`
**2. Expanded Analysis and Failure Mode Taxonomy**
* **Systematic Failure Mode Analysis (Section 4.2):** Analyzed over 3,000 reasoning traces and categorized failures into six distinct modes (context overload, counting errors, map hallucinations, rule misinterpretations, logical inconsistencies, formatting errors) using an LLM-as-a-judge (`gpt-4.1`).
**3. Text Revisions, Clarifications, and Discussions**
* **Related Work:** Expanded the section to better position the corridor benchmark within the broader Sokoban literature, emphasizing its purpose (isolating horizon depth from branching). Added references to Hu et al. (2025a) and Wang et al. (2025).
* **Optimality Criterion (Section 3.4):** Clarified the design choice of evaluating only the unique optimal solution to prevent rewarding trial-and-error over structured forward reasoning.
* **Context Length vs. Planning Degradation (Section 4.1):** Clarified that performance degradation (especially seen in `grok-4.1-fast`) is tied to the accumulation of execution errors over long horizons, not just context window limits.
* **LLM-Modulo Setting (Section 4.3):** Revised the text to frame the LLM-modulo setting as an assisted upper-bound diagnostic configuration rather than a budget-matched comparative baseline, included LLM-Modulo for `gemini-3-pro`.
* **Conclusions / Limitations:** Acknowledged the impossibility of fully ruling out training data exposure. Added the exploration of infeasible or adversarial 1D layouts to the "Current limitations and Future works" section.
* **Minor typos or rephrasing:** corrected/improved in the text.
**4. Appendix Additions**
* **Appendix E:** Added the prompt used for the LLM-as-a-judge failure mode classification.
* **Appendix F / H:** Added the full explanation, illustration, and results of the new "Freetski" problem.
* **Appendix G (Figure 10):** Added a figure showing `gemini-3-pro`'s performance drop-off on corridor lengths reaching 100.
* **Appendix H:** included description and experiments on Freetski benchmark.
Assigned Action Editor: ~Marc_Lanctot1
Submission Number: 7168
Loading