Spatial Reasoning Through Modality Switching Across Language, Vision, and Symbols

Shreya Rajpal; Tanawan Premsri; Parisa Kordjamshidi

Spatial Reasoning Through Modality Switching Across Language, Vision, and Symbols

Shreya Rajpal, Tanawan Premsri, Parisa Kordjamshidi

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: spatial reasoning, multi-hop reasoning, modality switching, trustworthiness

TL;DR: We improve multi-hop spatial reasoning from text by converting story text into grid layouts the model can reason over, and by switching between text-only and grid-based grounding only when it helps.

Abstract: Human reasoning is inherently multimodal: when problems become difficult, we rarely think in words alone. We often sketch diagrams and draw grids to externalize reasoning and understand the underlying conceptual structure for better reasoning and to avoid mistakes. Recent benchmarks indicate that even strong LLMs struggle with multi-hop spatial reasoning in purely textual narratives, especially as the depth of reasoning increases but the scene geometry remains implicit. Therefore, we study whether large language models can reason more robustly about multi-hop textual--spatial stories when they convert implicit spatial structure in language into an explicit geometric form, such as a grid or layout, rather than relying on natural language text for inference. Based on this hypothesis, we introduce a grid-based visualization framework that maps entities and spatial relations described in a story into an explicit 2D grid that preserves the scene geometry. In this framework, the model extracts relations, uses them to construct a grid, and performs multi-hop reasoning over the resulting layout. We further propose pruned grids that retain only the question-relevant entities and relations, producing compact layouts that reduce noise and improve performance. Across multiple benchmarks, grid-based visualization yields consistent improvements over text-only baselines using strong backbone models, up to 42\% on StepGame (LLaMa 3.1-70B), 8.5\% on SpaRTUN (GPT 5.1), and 8.5\% on ReSQ (LLaMa 3.1-70B). We also improve the state of the art on StepGame by 2.3\% and on ReSQ by 11\%. These gains arise because the grid-based visualization is used as an intermediate representation that the model explicitly reasons over for multi-hop inference. To further improve efficiency and accuracy, we propose a switching mechanism that selectively constructs explicit structures only when they are likely to provide reasoning benefits, motivating visualization selectively rather than indiscriminately. We introduce a switching metric that estimates when a model should remain in text-only reasoning and when it should invoke structured spatial grounding. The metric combines trustworthiness and complexity signals from each instance to enable principled modality selection rather than relying on a fixed modality. This adaptive modality selection strategy yields an 11.4\% improvement over a text-only baseline and a 2.8\% improvement over a grid-only baseline in GPT-5.1.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 135

Loading