Keywords: multimodal large language models, spatial reasoning, visual planning, maze solving, benchmark, test-time compute, token efficiency
TL;DR: High maze-solving scores can be misleading: frontier multimodal models solve visual mazes by translating images into text grids and brute-force searching in token space rather than planning visually.
Abstract: How do multimodal models solve visual spatial tasks---through genuine planning, or by brute-forcing solutions in token space? We introduce MazeBench, a benchmark of 110 procedurally generated maze images organized into nine controlled groups (diagnostic, grid scale, wall density, trap ablation, unreachable detection, and more), and evaluate 16 model configurations across four providers (OpenAI, Anthropic, Google, Alibaba) at multiple reasoning effort levels. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but our analysis reveals these scores are misleading: models translate images into text grids and brute-force paths via serial enumeration, consuming 1,710--22,818 tokens per solve for a task humans do in seconds. Without added reasoning budgets, all configurations score only 2--12%; on 20x20 ultra-hard mazes, they hit token limits and give up. Qualitative analysis of model outputs confirms a universal two-stage strategy: image-to-grid translation followed by step-by-step path search in natural language---essentially BFS implemented in prose. A text-grid ablation shows Claude's poor image performance (6%) jumps to 80% when given the correct grid directly, confirming vision quality, not reasoning ability, as the bottleneck for weaker models. Perhaps most striking, when we explicitly instruct models not to build a text grid and not to perform graph search---asking them to "reason visually, like a human"---they silently ignore the instruction and immediately fall back to the same grid-enumeration strategy. This suggests that brute-force token-level search is the dominant mechanism these models rely on for spatial planning in our setting.
Submission Number: 22
Loading