Sphinx: Visual Perception and Reasoning Gym

20 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal reasoning, vLLM, Synthetic datasets
Abstract: We present \textsc{Sphinx}, a synthetic gym for visual perception and reasoning tasks that targets core cognitive primitives. \textsc{Sphinx} procedurally generates problems using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions. This design enables both precise evaluation and the creation of scalable datasets. We implement 25 task types spanning symmetry detection, geometric transformation, spatial reasoning, chart interpretation, and sequence prediction. Benchmarking recent multimodal vision–language models (vLLMs) reveals that even state-of-the-art GPT-5 struggles on these tasks, achieving 47.32% accuracy and performing significantly below human baselines. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) improves model accuracy on these reasoning tasks, underscoring its potential for advancing multimodal reasoning.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 25162
Loading