Hierarchy-Based Diagram-Sentence Matching on Dual-Modal Graphs

Published: 01 Jan 2025, Last Modified: 22 Jul 2025Pattern Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Diagram is a special kind of image drawn by domain experts, which mainly consists of graphic symbols and abstract drawings. It is essential to combine the diagrams with other modalities (e.g. textual descriptions and subtitles in teaching videos) for in-depth understanding of the knowledge concepts. Diagram-sentence matching, a novel task proposed to bridge abstract diagram representation and explicit natural language, is significant to textbook question answering (TQA) and diagram understanding but remains challenging. Existing vision-language matching works mainly focus on the field of natural images and are not applicable to diagrams due to the following two characteristics: (1) the relation in diagrams has diversified representation forms; (2) the knowledge concepts conveyed in diagrams are key to fine-grained diagram-sentence matching. In this paper, we propose the Hierarchy-Based Diagram-Sentence Matching (HBDSM) model and transfer this problem into a cross-modal knowledge concept matching task at multiple levels. To achieve this, the HBDSM first encodes the diagram and sentence as symmetrical dual-modal graphs. For diagram, a novel Visual Relation Structure Learning (VRSL) method is designed to explore the structural relations between objects, which constitute the edges. For sentence, words are fused into object and relation chunks as nodes, associated by edges according to their semantic dependencies. Motivated by the human cognitive process, the fine-grained correspondence between diagram and sentence is modeled based on the hierarchy of dual-modal graphs progressively, using from low-order to high-order information. Node-level matching establishes alignment of object nodes, based on which structure-level matching compares the internal structures of both graphs. Further, concept-level matching includes relation semantics to match the cross-modal concepts based on structure alignment. Extensive experiments demonstrate the effectiveness of HBDSM in diagram-sentence matching, achieving new state-of-the-art results with relative improvement of 20.0% at rSum on AI2D#. Competitive performances of image-sentence matching on Flickr30K and MSCOCO also verify certain applicability of HBDSM for natural images.
Loading