Towards Video Text Visual Question Answering: Benchmark and Baseline

Minyi Zhao; Bingjia Li; Jie Wang; Wanqing Li; Wenjing Zhou; Lan Zhang; Shijie Xuyang; Zhihang Yu; Xinkun Yu; Guangze Li; Aobotao Dai; Shuigeng Zhou

Towards Video Text Visual Question Answering: Benchmark and Baseline

Minyi Zhao, Bingjia Li, Jie Wang, Wanqing Li, Wenjing Zhou, Lan Zhang, Shijie Xuyang, Zhihang Yu, Xinkun Yu, Guangze Li, Aobotao Dai, Shuigeng Zhou

Published: 17 Sept 2022, Last Modified: 23 May 2023NeurIPS 2022 Datasets and Benchmarks Readers: Everyone

Keywords: video text visual question answering, text-based visual question answering, video question answering

TL;DR: A new dataset M4-ViteQA and a new method T5-ViteVQA for a new task video text visual question answering.

Abstract: There are already some text-based visual question answering (TextVQA) benchmarks for developing machine's ability to answer questions based on texts in images in recent years. However, models developed on these benchmarks cannot work effectively in many real-life scenarios (e.g. traffic monitoring, shopping ads and e-learning videos) where temporal reasoning ability is required. To this end, we propose a new task named Video Text Visual Question Answering (ViteVQA in short) that aims at answering questions by reasoning texts and visual information spatiotemporally in a given video. In particular, on the one hand, we build the first ViteVQA benchmark dataset named M4-ViteVQA --- the abbreviation of Multi-category Multi-frame Multi-resolution Multi-modal benchmark for ViteVQA, which contains 7,620 video clips of 9 categories (i.e., shopping, traveling, driving, vlog, sport, advertisement, movie, game and talking) and 3 kinds of resolutions (i.e., 720p, 1080p and 1176x664), and 25,123 question-answer pairs. On the other hand, we develop a baseline method named T5-ViteVQA for the ViteVQA task. T5-ViteVQA consists of five transformers. It first extracts optical character recognition (OCR) tokens, question features, and video representations via two OCR transformers, one language transformer and one video-language transformer, respectively. Then, a multimodal fusion transformer and an answer generation module are applied to fuse multimodal information and generate the final prediction. Extensive experiments on M4-ViteVQA demonstrate the superiority of T5-ViteVQA to the existing approaches of TextVQA and VQA tasks. The ViteVQA benchmark is available in https://github.com/bytedance/VTVQA.

Author Statement: Yes

URL: https://github.com/bytedance/VTVQA for annotations; https://drive.google.com/file/d/1XuPMW9hcWWjuTgjjQBb89j7WFn-tiCJU/view for videos, frames and features. Please note that when the reviewers download the videos, we assume that they have accepted the responsibility agreement.

Dataset Url: https://github.com/bytedance/VTVQA

License: The researcher shall use the M4-ViteVQA dataset only for non-commercial algorithm research and educational purposes. The researcher can not use the M4-ViteVQA dataset for any other purposes, including but not limited to distribution, commercial usage, etc... The researcher takes full responsibility for his or her use of the M4-ViteVQA dataset and shall defend and indemnify the dataset, including their affiliates, employees, trustees, officers and agents, against any and all claims arising from the researcher’s use of the M4-ViteVQA dataset. The researcher agrees and confirms that authors reserve the right to terminate the researcher’s access to the M4-ViteVQA dataset at any time. If the researcher is employed by a for-profit business entity, the researcher’s employer shall also be bound by these terms and conditions, and the researcher hereby shall represent that he or she is fully authorized to enter into this agreement on behalf of such employer.

Supplementary Material: pdf

Contribution Process Agreement: Yes

In Person Attendance: Yes

16 Replies

Loading