MVARN: Multi-view Attention Relation Network for Figure Question Answering

Yingdong Wang, Qingfeng Wu, Weiqiang Lin, Linjian Ma, Ying Li

Published: 01 Jan 2023, Last Modified: 17 Nov 2023KSEM (3) 2023Readers: Everyone

Abstract: Figure Question Answering (FQA) is an emerging multimodal task that shares similarities with Visual Question Answering (VQA). FQA aims to solve the problem of answering questions related to scientifically designed charts. In this study, we propose a novel model, called the Multi-view Attention Relation Network (MVARN), which utilizes key picture characteristics and multi-view relational reasoning to address this challenge. To enhance the expression ability of image output features, we introduce a Contextual Transformer (CoT) block that implements relational reasoning based on both pixel and channel views. Our experimental evaluation on the Figure QA and DVQA datasets demonstrates that the MVARN model outperforms other state-of-the-art techniques. Our approach yields fair outcomes across different classes of questions, which confirms its effectiveness and robustness.

0 Replies