Sphinx: Visual Perception and Reasoning Gym

Md Tanvirul Alam; Saksham Aggarwal; Justin Yang Chae; Nidhi Rastogi

Sphinx: Visual Perception and Reasoning Gym

Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi

20 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal reasoning, vLLM, Synthetic datasets

Abstract: We present \textsc{Sphinx}, a synthetic gym for visual perception and reasoning tasks that targets core cognitive primitives. \textsc{Sphinx} procedurally generates problems using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions. This design enables both precise evaluation and the creation of scalable datasets. We implement 25 task types spanning symmetry detection, geometric transformation, spatial reasoning, chart interpretation, and sequence prediction. Benchmarking recent multimodal vision–language models (vLLMs) reveals that even state-of-the-art GPT-5 struggles on these tasks, achieving 47.32% accuracy and performing significantly below human baselines. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) improves model accuracy on these reasoning tasks, underscoring its potential for advancing multimodal reasoning.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 25162

Loading