SciVSG: Unified Visual-Semantic Graph with Traceable Evidence for Scientific Diagram Understanding

ACL ARR 2026 January Submission9316 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scientific Diagram Understanding; Visual-Semantic Representation; Visual Layout Hierarchy; Semantic Scene Graph; Knowledge Extraction
Abstract: Scientific diagrams capture complex mechanisms, experimental evidence, and structural relationships, yet remain challenging to interpret and reason over due to their heterogeneous layouts and lack of traceable representations. Scientific Visual-Semantic Graph (SciVSG) addresses this challenge by providing a unified visual-semantic representation that integrates layout and semantic information. It first constructs a Visual Layout Hierarchy (VLH) from layout cues and reading conventions, establishing a structured foundation for diagram understanding. Node-level verifiable evidence, including localized OCR and aligned paper snippets, grounds predictions to explicit text spans and regions. On this basis, a Semantic Scene Graph (SSG) is built by linking typed entities and normalized relations to nodes under strict evidence constraints, enabling module-aware reasoning and fine-grained traceability. A benchmark of diverse scientific diagrams is also provided, annotated with VLH, node evidence, entities, relations, and expert-authored QA pairs across multiple categories. Experiments demonstrate that SciVSG substantially enhances knowledge extraction and produces more reliable, evidence-attributable answers for diagram-based question answering.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation; multimodality
Contribution Types: Model analysis & interpretability, Data resources, Theory
Languages Studied: English
Submission Number: 9316
Loading