Scene text visual question answering.Download PDF

31 Jan 2020OpenReview Archive Direct UploadReaders: Everyone
Abstract: Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. In addition we propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module.
0 Replies

Loading