From Scenes to Semantics: PersianCLEVR for Bilingual 3D Visual Reasoning

Kianoosh Vadaei; Melika Shirian; Arshia Hemmat; Mohammad Hassan Heydari; Ali Mamanpoosh; Afsaneh Fatemi

From Scenes to Semantics: PersianCLEVR for Bilingual 3D Visual Reasoning

Kianoosh Vadaei, Melika Shirian, Arshia Hemmat, Mohammad Hassan Heydari, Ali Mamanpoosh, Afsaneh Fatemi

Published: 12 Nov 2025, Last Modified: 18 Nov 2025VLM4RWD2025 RegularSpotlightEveryoneRevisionsBibTeXCC BY 4.0

Track: Regular papers (within 8 pages excluding appendix)

Keywords: bilingual benchmark, 3D visual reasoning, vision–language models, compositional reasoning

TL;DR: PersianCLEVR is a bilingual (English–Persian) 3D visual reasoning benchmark built by unifying CLEVR, Super-CLEVR, and ClevrTex, then synthesizing missing QA for ClevrTex with an instruction-tuned vision LLM

Abstract: Vision–language models (VLMs) have made rapid progress on 2D visual reasoning, yet robust three-dimensional (3D) understanding and multilingual generalisation (particularly in Persian) remain underexplored. To address this gap, we introduce PersianClevr, a bilingual (English–Persian) benchmark targeting 3D scene understanding across five reasoning skills: attribute identification, counting, comparison, spatial relationships, and logical operations. PersianClevr is constructed by unifying CLEVR, Super-CLEVR, and ClevrTex; we synthesize missing question–answer pairs for ClevrTex with an instruction-tuned vision LLM and categorise items using an automated pipeline, then translate and balance the full set to yield parallel English–Persian splits. We outline evaluation protocols that test instructed VLMs in zero-shot and in-context learning settings, and include standard text metrics (BLEU, METEOR, ROUGE) for assessing translation quality alongside task accuracy. Together, these components provide a controlled, multilingual testbed for diagnosing compositional and spatial reasoning in 3D. We present baseline experiments and analyses to chart current strengths and failure modes, and to spur research on 3D-aware, multilingual VLMs.

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Submission Number: 10

Loading