DeepMaze: A Maze Benchmark for Quantifying Reasoning Depth in Language Models

DeepMaze: A Maze Benchmark for Quantifying Reasoning Depth in Language Models

ACL ARR 2026 January Submission10790 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spartial Reasoning, Benchmark, Maze, Memory, Reasoning Depth

Abstract: Existing benchmarks for evaluating reasoning in large language models primarily emphasize final-answer correctness, making it difficult to distinguish genuine multi-step reasoning from statistical shortcuts and prior knowledge exploitation. Accurately measuring reasoning depth requires environments where this confounding is eliminated; however, existing benchmarks retain data contamination and static test distribution bias that enable prior knowledge to masquerade as reasoning, preventing clean capability isolation. To overcome this, we introduce DeepMaze, a minimalist benchmark of procedurally generated environments with rigorously controlled topology. Its dual-task architecture—comprising planning under full observability and exploration under partial observability—inherently necessitates sustained, state-consistent reasoning by requiring models to dynamically track environmental states across sequential actions. Within this environment, we define a reasoning depth metric that quantifies the length of state-consistent action sequences, explicitly decoupling process quality from outcome success. This design isolates LLMs' core reasoning capabilities under controlled conditions, establishing a foundation for evaluating their true multi-step reasoning proficiency independent of domain-specific knowledge or outcome-driven shortcuts.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,evaluation methodologies,evaluation,metrics

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 10790

Loading