SchemixQA and CoRe-VLM: A Benchmark and Collaborative Refinement (CoRe) Framework for Visual Question Answering on Technical Schematics
Keywords: SchemixQA, CoRe-VLM Framework, VLM, LLM, Actor-Critic Refinements
Abstract: We present SchemixQA, a multimodal benchmark for evaluating Vision–Language Models (VLMs) on Visual Question Answering (VQA) over technical schematics. Unlike previous VQA datasets focused on natural images, SchemixQA targets structured domains such as circuits, flowcharts, logic gates, P&I diagrams, and state diagrams, each paired with natural language questions and multiple reference answers. To address this setting, we introduce CoRe-VLM (Collaborative Refinement for VLMs), the first actor–critic inspired refinement framework for schematic VQA. In CoRe-VLM, an actor VLM generates answers, while a critic VLM verifies them and provides corrective feedback. A fallback mechanism ensures robustness by reverting to the actor’s output when the critic introduces errors. We benchmark seven state-of-the-art VLMs, including GPT-4o, Gemini, Qwen2 and LLaVA, under single-pass and CoRe-VLM inference. The results show that CoRe-VLM consistently improves lexical (Exact Match, BLEU, ROUGE-L) and semantic (BERTScore, Macro/Micro-F1) metrics, with especially strong gains for weaker open-source actors when paired with a strong critic. Together, SchemixQA and CoRe-VLM establish a new foundation for domain-specific multimodal reasoning.
Primary Area: datasets and benchmarks
Submission Number: 13662
Loading