Abstract: Diagram question answering (DQA) is an effective way to evaluate the reasoning ability for diagram semantic understanding, which is a very challenging task and largely understudied compared with natural images. Existing separate two-stage methods for DQA are limited in ineffective feedback mechanisms. To address this problem, in this paper, we propose a novel structural parsing-integrated Hierarchical Multi-Task Learning (HMTL) model for diagram question answering based on a multi-modal transformer framework. In the proposed paradigm of multi-task learning, the two tasks of diagram structural parsing and question answering are in the different semantic levels and equipped with different transformer blocks, which constituents a hierarchical architecture. The structural parsing module encodes the information of constituents and their relationships in diagrams, while the diagram question answering module decodes the structural signals and combines question-answers to infer correct answers. Visual diagrams and textual question-answers are interplayed in the multi-modal transformer, which achieves cross-modal semantic comprehension and reasoning. Extensive experiments on the benchmark AI2D and FOODWEBS datasets demonstrate the effectiveness of our proposed HMTL over other state-of-the-art methods.
0 Replies
Loading