TL;DR: FloorplanQA evaluates large language models’ spatial and geometric reasoning on structured indoor layouts, featuring questions on topological logic and design constraints, revealing gaps in models’ ability to reason about spatial environments.
Abstract: We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed on shallow queries, they often fail to respect physical constraints and preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today’s LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.
Lay Summary: Large language models are increasingly used for layout generation (e.g., for embodied AI applications) and indoor scene understanding, but it is still unclear how well they can reason about physical space and geometry. In this work, we introduce FloorplanQA, a benchmark for evaluating LLMs' spatial reasoning in the indoor floorplan setting. The benchmark tests whether the models can reason about navigation, object placement, adherence to spatial constraints, and the maintenance of geometric relationships within 2D layouts.
To build FloorplanQA, we created a diverse set of floorplan reasoning tasks using both generated and real-world-inspired layouts. We then evaluated several recent multimodal and language models on these tasks. Our experiments show that current models often fail at basic spatial reasoning tasks. In particular, they struggle with navigation, overlap handling, and following geometric constraints.
We believe this benchmark can help researchers better understand the limitations of current AI systems and develop models that reason more reliably about physical environments. In the long term, progress in spatial reasoning may support applications in embodied AI, robotics, architectural design tools, and assistive planning systems.
Originally Submitted Supplementary Material: gz
Link To Code: https://github.com/OldDeLorean/FloorplanQA
Primary Area: General Machine Learning->Representation Learning
Keywords: Spatial Reasoning, Layout Reasoning, Scene Understanding, Structured Scene Representations, Benchmark, Large Language Models (LLMs)
Originally Submitted PDF: pdf
Submission Number: 7169
Loading