Abstract: Diagram question answering (DQA), which is defined as answering natural language questions according to the visual diagram context, has attracted attention and has recently become a new benchmark for evaluating the complex reasoning ability of models. However, this reasoning task is extremely challenging because of the inclusion of abstract visual objects and specialized textual terms, as well as the complex relationships between them. The rarity of data caused by the high cost of annotation also makes large-scale deep models invalid for the DQA task. To address the above challenges, this paper proposes the cross-modal alignment-guided self-supervised learning model for DQA (CAS-DQA). Unlike previous works, the CAS-DQA model focuses on learning internal visual-textual object relationships, innovatively proposes an attention mechanism module based on object alignment, and effectively integrates cross-modal knowledge units for diagram understanding. In addition, the CAS-DQA model constructs two self-supervised learning (SSL) tasks via intermediate results of visual-textual object alignment. These two tasks exploit the unnoticed objects inside the diagram to fully and completely understand the diagram. They also effectively increase the amount of diagram question-answering data to address the challenge of data scarcity. To the best of our knowledge, the CAS-DQA model is the first to extend SSL strategies to the diagram question-answering task. We evaluate the CAS-DQA model on three different datasets. The results of extensive experiments show that our model significantly outperforms baselines on different scenarios and that the internal object alignment module and self-supervised tasks produce excellent results.
Loading