SchemixQA and CoRe-VLM: A Benchmark and Collaborative Refinement (CoRe) Framework for Visual Question Answering on Technical Schematics

Soumen Biswas; Gaurang Chaturvedi; Abhishek Mitra; Nilanjan Chakravortty; Ananth Ganesh; Kingshuk Banerjee

SchemixQA and CoRe-VLM: A Benchmark and Collaborative Refinement (CoRe) Framework for Visual Question Answering on Technical Schematics

Soumen Biswas, Gaurang Chaturvedi, Abhishek Mitra, Nilanjan Chakravortty, Ananth Ganesh, Kingshuk Banerjee

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SchemixQA, CoRe-VLM Framework, VLM, LLM, Actor-Critic Refinements

Abstract: We present SchemixQA, a multimodal benchmark for evaluating Vision–Language Models (VLMs) on Visual Question Answering (VQA) over technical schematics. Unlike previous VQA datasets focused on natural images, SchemixQA targets structured domains such as circuits, flowcharts, logic gates, P&I diagrams, and state diagrams, each paired with natural language questions and multiple reference answers. To address this setting, we introduce CoRe-VLM (Collaborative Refinement for VLMs), the first actor–critic inspired refinement framework for schematic VQA. In CoRe-VLM, an actor VLM generates answers, while a critic VLM verifies them and provides corrective feedback. A fallback mechanism ensures robustness by reverting to the actor’s output when the critic introduces errors. We benchmark seven state-of-the-art VLMs, including GPT-4o, Gemini, Qwen2 and LLaVA, under single-pass and CoRe-VLM inference. The results show that CoRe-VLM consistently improves lexical (Exact Match, BLEU, ROUGE-L) and semantic (BERTScore, Macro/Micro-F1) metrics, with especially strong gains for weaker open-source actors when paired with a strong critic. Together, SchemixQA and CoRe-VLM establish a new foundation for domain-specific multimodal reasoning.

Primary Area: datasets and benchmarks

Submission Number: 13662

Loading