Abstract: Existing Large Multimodal Models (LMMs) demonstrate excellent performance in handling visual tasks in everyday scenarios. However, they still face challenges in understanding structured images, such as flowcharts and organizational charts, which are characterized by text-rich and complex hierarchical components. In this paper, we propose SiQA, a knowledge construction and Retrieval-Augmented Generation(RAG)-based multimodal Question-Answering model designed for Structured Images. SiQA operates in three stages: Knowledge Graph (KG) generation, retrieval-augmented, and answer generation. First, a KG representing the semantics of the structured images is generated through component analysis. We then performed similarity retrieval between the KG and queries, using a node-first algorithm to construct the most relevant subgraph. Finally, after performing an encoding alignment on the multimodal information, it is fed into the LLM to generate the answer. Additionally, we introduce a new dataset, OCQA1, which includes 5,112 questions derived from 1,000 Organizational Charts. We evaluated SiQA’s structured image detection and question-answering capabilities on the FD-DETR (a flowchart dataset) and SCQA, and verified its effectiveness and strong generalization ability through comparisons with existing state-of-the-art (SOTA) methods.
External IDs:dblp:conf/icassp/Liu0W0Q25
Loading