A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Weijia Wu; Debing Zhang; Yuanqiang Cai; Sibo Wang; Jiahong Li; Zhuang Li; Yejun Tang; Hong Zhou

A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Weijia Wu, Debing Zhang, Yuanqiang Cai, Sibo Wang, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou

Published: 11 Oct 2021, Last Modified: 04 May 2025NeurIPS 2021 Datasets and Benchmarks Track (Round 2)Readers: Everyone

Keywords: video text spotting, text detection and recognition

TL;DR: A Multilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Abstract: Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There are four features for BOVText. Firstly, we provide 1,850+ videos with more than 1,600,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, Life Vlog, Driving, Movie, etc. Thirdly, abundant text types annotation (i.e., title, caption, or scene text) are provided for the different representational meanings in the video. Fourthly, the MOVText provides multilingual text annotation to promote multiple cultures' live and communication. Besides, we propose an end-to-end video text spotting framework with Transformer, termed TransVTSpotter, which solves the multi-orient text spotting in video with a simple, but efficient attention-based query-key mechanism. It applies object features from the previous frame as a tracking query for the current frame and introduces a rotation angle prediction to fit the multi-orient text instance. On ICDAR2015(video), TransVTSpotter achieves state-of-the-art performance with 44.2% MOTA, 13 fps. The dataset and code of TransVTSpotter can be found at https://github.com/weijiawu/BOVText-Benchmark and https://github.com/weijiawu/TransVTSpotter, respectively.

URL: https://github.com/weijiawu/BOVText-Benchmark for the benchmark(BOVText), and https://github.com/weijiawu/TransVTSpotter for the proposed method(TransVTSpotter).

Supplementary Material: pdf

Contribution Process Agreement: Yes

Dataset Url: the URL of the dataset: https://github.com/weijiawu/BOVText-Benchmark

License: The released video dataset includes two parts: 1,494 videos from KuaiShou and 356 videos from YouTube. For those videos from KuaiShou, we mask the private information such as the human face, which has passed the examination of the legal department and copyright department of KuaiShou corporation. Thus, we own the copyright for these videos. For those videos from YouTube, to the best of our knowledge at the time of download, we have exercised caution to download only those videos that were available on YouTube with a Creative Commmons CC-BY (v3.0) License. We don't own the copyright of those videos and provide them for non-commercial research purposes only. All data in our project is open source under CC-by 4.0 license and only be used for research purposes.

Author Statement: Yes

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/a-bilingual-openworld-video-text-dataset-and/code)

15 Replies

Loading