A Bilingual, Open World Video Text Dataset and Real-Time Video Text Spotting With Contrastive Learning

Weijia Wu, Zhuang Li, Yuanqiang Cai, Hong Zhou, Mike Zheng Shou

Published: 01 Jan 2025, Last Modified: 12 Apr 2025IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText). There are four features for BOVText. Firstly, we provide 2,021 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 32 open scenarios, including many virtual scenarios, e.g., Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in the video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures’ lives and communication. Besides, we propose a real-time end-to-end video text spotting with Contrastive Learning of Semantic and Visual Representation (CoText), which includes two advantages: 1) With a lightweight architecture, CoText simultaneously addresses the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) CoText tracks texts by comprehending them and relating them to each other with visual and semantic representations. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting $\mathrm { ID_{F1}}$ of 71.7% at 32.3 FPS on ICDAR2015video, with 10.2% and 23.3 FPS improvement the previous best method. The dataset and code of CoText can be found at: Dataset and CoText, respectively.