AlphaMaze: Enhancing Spatial Intelligence in Large Language Models

ACL ARR 2025 May Submission994 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Although Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, they often struggle with tasks requiring spatial reasoning, particularly for applications like robot navigation where understanding the robot's position relative to its environment is key. We design a text-based reasoning benchmark, MazeBench, consisting of 5x5 mazes rendered as text with varying complexity, to investigate spatial reasoning in text-based reasoning models. On this benchmark, DeepSeek-R1-671B solves 74\% of the mazes in a zero-shot manner. However, with Supervised Finetuning (SFT), our model AlphaMaze-SFT, solves 87\% of mazes using only 1.5B parameters. Further refinement with Group Relative Policy Optimization (GRPO) allowed AlphaMaze-GRPO to solve 95\% of the benchmark. Our results demonstrate that while spatial reasoning can be achieved by a powerful general reasoning model, a smaller specialist model can also achieve significant spatial reasoning capabilities, presenting a viable approach in resource-constrained applications such as robotics.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation, cross-modal pretraining, cross-modal application, multimodality
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: English
Submission Number: 994
Loading