Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video-text retrieval is an important task in the multimodal understanding field.Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in assessing model's retrieval ability, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models.In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset, constructed through a top-down three-step process. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We recruit annotators to judge the significance and reversibility of candidate videos, and then write captions for qualified videos. We further adopt GPT-4 to extend more captions based on human-written captions. Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours. Based on RTime, we propose three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary.We further enforce leveraging harder-negatives in model training, and benchmark a variety of video-text models on RTime. Extensive experiment analysis proves that RTime indeed poses new and higher challenges to video-text retrieval.We will release our RTime benchmarks to further advance video-text retrieval and multimodal understanding research.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Multimedia data types by their very nature are complex and often involve intertwined instances of different kinds of information. We can leverage this multi-modal perspective in order to extract meaning and understanding of the world, often with surprising results. Research has driven the merging of vision and language in different ways, for example, captioning, question-answering, multi-modal chatbots, multi-modal retrieval. Vision-language area seeks new solutions and results that are specific to the problems of combining or bridging vision and language. Video-text retrieval is an important task in the vision-language understanding field.This work aims to address the lack of temporal understanding evaluation in existing video-text retrieval benchmarks. We introduce RTime, a novel fine-grained temporal-emphasized video-text dataset, carefully constructed in a top-down three-step pipeline by leveraging the power of large language models and human expertise. We further establish three benchmark tasks: RTime-Origin retrieval, RTime-Hard retrieval, and RTime-Binary retrieval, which can support comprehensive and faithful evaluation of video understanding capabilities especially in terms of temporal understanding. The extensive experiment analysis confirms that RTime indeed poses higher challenges to video-text retrieval.
Supplementary Material: zip
Submission Number: 1688
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview