Keywords: video text spotting, text detection and recognition, video understanding with text
Abstract: Video text spotting is crucial for numerous real application scenarios, but most existing video text reading benchmarks are challenging to evaluate the performance of advanced deep learning algorithms due to the limited amount of training data and tedious scenarios. To address this issue, we introduce a new large-scale benchmark dataset named Multidimensional Multilingual Video Text (MMVText), the first large-scale and multilingual benchmark for video text spotting in a variety of scenarios. There are mainly three features for MMVText. Firstly, we provide 510 videos with more than 1,000,000 frame images, four times larger than the existing largest dataset for text in videos. Secondly, our dataset covers 30 open categories with a wide selection of various scenarios, life vlog, sports news, automatic drive, cartoon, etc. Besides, caption text and scene text are separately tagged for the two different representational meanings in the video. The former represents more theme information, and the latter is the scene information. Thirdly, the MMVText provides multilingual text annotation to promote multiple cultures live and communication. In the end, a comprehensive experimental result and analysis concerning text detection, recognition, tracking, and end-to-end spotting on MMVText are provided. We also discuss the potentials of using MMVText for other video-and-text research.
Supplementary Material: zip
URL: https://github.com/weijiawu/MMVText-Benchmark
11 Replies
Loading