SciGram: a large-scale dataset for scientific diagram understanding

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal large language models, diagram understanding, visual question answering, instruction tuning, dataset generation, synthetic data, scientific diagrams, vision-language alignment
TL;DR: We introduce SciGram, a web-crawled 200K scientific diagram dataset with synthetic instruction data which enables fine-tuning LLaVA on SciGram, achieving state-of-the-art results on diagram questions from TQA, ScienceQA, and AI2D.
Abstract: Multimodal large language models (MLLMs) have achieved strong results in visual question answering with natural images, yet their performance on diagram-based reasoning remains limited, largely due to the scarcity of high-quality training data. We present SciGram, a dataset of 200,000 scientific diagrams paired with synthetic instruction-following data grounded in middle-school science terminology. SciGram is built through a cost-efficient pipeline for generating multimodal alignment and instruction data tailored to vision-language understanding with scientific diagrams. Fine-tuning LLaVA on SciGram (LLaVA-SciGram) consistently improves diagram-based question answering on TQA, ScienceQA, and AI2D, surpassing LLaVA-OneVision with substantially fewer instructions. Furthermore, incorporating SciGram as an additional instruction-tuning stage for LLaVA-OneVision establishes new state-of-the-art results across all three benchmarks, underscoring the robustness and effectiveness of our dataset. To foster progress in diagram understanding, we release both the SciGram dataset and the LLaVA-SciGram model.
Primary Area: datasets and benchmarks
Submission Number: 13059
Loading