Abstract: Highlights•The paper proposed a large cross-modal video retrieval dataset with text reading comprehension, named TextVR, which includes 42.2k sentence queries for 10.5k videos of 8 scenario domains. Different from previous benchmarks, TextVR requires models to retrieve video with both of semantic information from text/OCR tokens and visual context simultaneously.•For cross-modal video retrieval task on TextVR, we present sufficient experiment analysis (e.g., unique evaluation for reading comprehension ability), newinsights and new challenges (e.g., negative impact from irrelevant and noise and text/OCR tokens).•An Scene Text Aware Video Retrieval baseline, StarVR, fusing both semantics from reading comprehension and visual representation is provided for the new task. Experiments show that current state-of-the-art cross-modal video retrieval methods fail on TextVR, while our StarVR with scene text semantic representation, gets encouraging results.
Loading