Ask Question with Double Hints:  Visual Question Generation with Answer-awareness and Region-reference

Kai Shen; Lingfei Wu; Siliang Tang; Fangli Xu; Zhu Zhang; Yu Qiang; Yueting Zhuang

Ask Question with Double Hints: Visual Question Generation with Answer-awareness and Region-reference

Kai Shen, Lingfei Wu, Siliang Tang, Fangli Xu, Zhu Zhang, Yu Qiang, Yueting Zhuang

28 Sept 2020 (modified: 11 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Semi-supervised Learning, graph neural network, vision and language, question generation

Abstract: The task of visual question generation~(VQG) aims to generate human-like questions from an image and potentially other side information (e.g. answer type or the answer itself). Despite promising results have been achieved, previous works on VQG either i) suffer from one image to many questions mapping problem rendering the failure of generating referential and meaningful questions from an image, or ii) ignore rich correlations among the visual objects in an image and potential interactions between the side information and image. To address these limitations, we first propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. In particular, we aim to ask the right visual questions with \emph{Double Hints - textual answers and visual regions of interests}, effectively mitigating the existing one-to-many mapping issue. To this end, we develop a simple methodology to self-learn the visual hints without introducing any additional human annotations. Furthermore, to capture these sophisticated relationships, we propose a new double-hints guided Graph-to-Sequence learning framework that first models them as a dynamic graph and learns the implicit topology end-to-end, and then utilize a graph-to-sequence model to generate the questions with double hints. Our experiments on VQA2.0 and COCO-QA datasets demonstrate that our proposed model on this new setting can significantly outperform existing state-of-the-art baselines by a large margin.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: For visual question generation task, we propose a new learning paradigm and a novel a double-hints guided graph-to-sequence learning framework to address the one-to-many mapping and object modeling with side information problems.

Reviewed Version (pdf): https://openreview.net/references/pdf?id=p59Rxyx-8V

12 Replies

Loading