When Spatial Reasoning Goes in Circles: Measuring Ordinal Consistency in Multimodal LLMs via Tournament Theory

Kaustubh S. Bukkapatnam; Rayan Malik; Atharv Kanchi

When Spatial Reasoning Goes in Circles: Measuring Ordinal Consistency in Multimodal LLMs via Tournament Theory

Kaustubh S. Bukkapatnam, Rayan Malik, Atharv Kanchi

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0

Keywords: multimodal LLMs, spatial reasoning, ordinal consistency, tournament graphs

TL;DR: We introduce CTR and OSC, tournament-graph metrics that reveal transitivity failures in multimodal LLM spatial reasoning. Experiments on synthetic 3D scenes show CTR predicts benchmark performance.

Abstract: Multimodal large language models (MLLMs) answer pairwise spatial queries---``Is object $A$ to the left of $B$?''---with increasing fluency, yet we show they routinely produce transitively inconsistent responses: simultaneously asserting $A \prec B$, $B \prec C$, and $C \prec A$ for the same axis and scene. We formalize this failure mode using tournament graph theory, introducing the \textbf{Cyclic Triple Rate} (CTR) and \textbf{Ordinal Spatial Consistency} (OSC) as model-level metrics. We prove six theorems: a random model achieves CTR\,$= 1/4$ exactly; computing optimal OSC is NP-hard via reduction to Minimum Feedback Arc Set; a score-ranking heuristic gives a $1/2$-approximation in $O(N\log N)$; and a Gaussian noise model yields a closed-form prediction $P(\text{cycle}) = \alpha\beta(1{-}\gamma) + (1{-}\alpha)(1{-}\beta)\gamma$. Querying five state-of-the-art MLLMs on rendered synthetic 3D scenes, we find CTR at $N=10$ ranges from 3.8\% (GPT-4o) to 18.4\% (LLaVA-1.6-34B) on the depth axis---up to $6\times$ higher than horizontal---and CTR predicts performance on four established spatial benchmarks with Spearman $\rho \leq -0.97$. The theoretical cycle formula fits observed data with a maximum residual of 2.9pp. A depth uncertainty parameter $\hat\sigma$ recovered from CTR observations alone matches direct estimates with $<5\%$ error. Augmenting supervised fine-tuning with our differentiable $\mathcal{L}_{\mathrm{cycle}}$ reduces CTR by up to 28\% in early training while improving pairwise accuracy.

Supplementary Material: pdf

Previously Accepted: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 24

Loading