SATURN: Symbolic Spatial Reasoning from 3D Scene Structure for Vision-Language

Danial Kamali; Parisa Kordjamshidi

SATURN: Symbolic Spatial Reasoning from 3D Scene Structure for Vision-Language

Danial Kamali, Parisa Kordjamshidi

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: Neuro-Symbolic, Compositional Reasoning, Vision and Language

Abstract: **Problem.** Vision--Language Models (VLMs) perform strongly on many multimodal tasks, yet they remain brittle on spatial reasoning that depends on explicit 3D structure, reference frames, and compositional constraint satisfaction. Recent work improves spatial reasoning either by training VLMs with large-scale spatial data or by augmenting them with external 3D tools and prompting strategies. However, trained approaches encode spatial competence implicitly in model weights, while training-free approaches often still rely on the VLM to plan or aggregate spatial reasoning steps. We investigate compositional 3D spatial reasoning through an explicit symbolic layer operating over approximate scene geometry reconstructed from a 2D image. **Approach.** We present **SATURN**, a training-free neuro-symbolic framework that externalizes spatial reasoning from the VLM into a geometric symbolic engine. SATURN decomposes visual reasoning into two stages: neural perception and symbolic spatial computation. Candidate objects are grounded using open-vocabulary detection and segmentation, while a VLM estimates atomic object attributes and concept scores. The system then reconstructs an approximate 3D object-centric scene proxy using monocular depth back-projection and object orientation estimation. From this representation, the spatial engine computes soft geometric predicates such as *left*, *above*, and *closer*. These predicates are composed through symbolic logical operators to support multi-step compositional reasoning under perceptual uncertainty. Unlike prompting-based tool pipelines, SATURN represents spatial relations as reusable geometric predicates that can be explicitly composed, enabling interpretable multi-step reasoning without any task-specific fine-tuning. **Results.** We conducted experiments in *CV-Bench-3D* and a novel compositional spatial reasoning benchmark called *SpatialPuzzles*. SpatialPuzzles is designed to evaluate multi-hop compositional spatial reasoning. Each query requires identifying sets of objects satisfying multiple attribute constraints and chained spatial relations across several objects. For example, a query may ask: "Find four objects such that object 1 is a green object, object 2 is a sedan, object 3 is right of object 4 from the camera's perspective, and object 1 is right of object 2 from object 2's own perspective" — requiring simultaneous constraint satisfaction across both camera-centric and object-centric reference frames. The dataset is motivated by limitations in existing 3D spatial reasoning benchmarks which have limited complexity (hops) in reasoning chains. SATURN improves spatial reasoning reliability compared to direct VLM inference. On *CV-Bench-3D*, SATURN improves depth and distance prediction relative to strong neural baselines. SATURN reaches **90.5%** accuracy on depth and **84.0%** on distance. On *SpatialPuzzles*, SATURN achieves substantial gains in compositional reasoning accuracy compared to the backbone model alone, improving from 51.9% to **84.6%**. These results suggest that explicitly computing spatial relations from geometry and composing them symbolically helps mitigate the compositional failures observed in purely neural VLMs. | Method | Depth | Distance | SpatialPuzzles | |---|---|---|---| | Qwen3VL-8B | 85.5 | 74.3 | 51.9 | | SpatialReasoner | 87.3 | 73.3 | 47.4 | | **SATURN** | **90.5** | **84.0** | **84.6** | *Table: Spatial reasoning performance with the same backbone.* **Conclusion.** SATURN demonstrates that equipping VLMs with an explicit geometric-symbolic layer substantially improves spatial reasoning on both standard and compositional benchmarks — without any retraining. Because the symbolic engine is modular and backbone-agnostic, it can be attached to any VLM to extend its reasoning capabilities to higher-complexity, multi-hop spatial queries. This points toward interpretable spatial AI where geometric computation, not implicit weights, drives spatial understanding.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 176

Loading