Question-Led Semantic Structure Enhanced Attentions for VQADownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: The exploit of the semantic structure in the visual question answering (VQA) task is a trending topic where researchers are interested in leveraging internal semantics and bringing in external knowledge to tackle more complex questions. The prevailing approaches either encode the external knowledge separately from the local context, which magnificently increases the complexity of the ensemble system, or use graph neural networks to model the semantic structure in the context, which suffers from the limited reasoning capability due to the relatively shallow network. In this work, we propose a question-led structure extraction scheme using external knowledge and explore multiple training methods, including direct attention supervision, SGHMC-EM Bayesian multitask learning, and masking strategies, to aggregate the structural knowledge into deep models without changing the architectures. We conduct extensive experiments on two domain-specific but challenging sub-tasks of VrR-VG dataset and demonstrate that our proposed methods achieve significant improvements over strong baselines, showing the promising potentials of applicability.
0 Replies

Loading