Diagram Perception Networks for Textbook Question Answering via Joint Optimization

Published: 01 Jan 2024, Last Modified: 14 Nov 2024Int. J. Comput. Vis. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Textbook question answering requires a system to answer questions with or without diagrams accurately, given multimodal contexts that include rich paragraphs and diagrams. Existing methods usually utilize a pipelined way to extract the most relevant paragraph from multimodal contexts and only employ convolutional neural networks to comprehend diagram semantics under the supervision of answer labels. The former will result in error accumulation, while the latter will lead to poor diagram understanding. To provide a remedy for the above issues, we propose an end-to-end DIagraM Perception network for textbook question answering (DIMP), which is jointly optimized by the supervision of relation predicting, diagram classification, and question answering. Specifically, knowledge extracting is regarded as a sequence classification task and optimized through the supervision of answer labels to alleviate error accumulation. To capture diagram semantics effectively, DIMP uses an explicit relation-aware method that first parses a diagram into several graphs under specific relations and then grasps the information propagation within them. Evaluation on two benchmark datasets shows that our method achieves competitive or better results without large data pre-training and constructing auxiliary tasks compared with current state-of-the-art methods. We provide comprehensive ablation studies and thorough analyses to determine what factors contribute to this success. We also make in-depth analyses for relational graph learning and joint optimization.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview