Abstract: Image-text retrieval stands as a pivotal task within information retrieval, gaining increasing importance with the rapid advancements in Visual-Language Pretraining models. However, current benchmarks for evaluating these models face limitations, exemplified by instances such as BLIP2 achieving near-perfect performance on existing benchmarks. In response, this paper advocates for a more robust evaluation benchmark for image-text retrieval, one that embraces several essential characteristics. Firstly, a comprehensive benchmark should cover a diverse range of tasks in both perception and cognition-based retrieval. Recognizing this need, we introduce ReCoS, a novel benchmark specifically designed for cross-modal image-text retrieval in complex real-life scenarios. Unlike existing benchmarks, ReCoS encompasses 12 retrieval tasks, with a particular focus on three cognition-based tasks, providing a more holistic assessment of model capabilities. To ensure the novelty of the benchmark, we emphasize the use of original data sources, steering clear of reliance on existing publicly available datasets to minimize the risk of data leakage. Additionally, to strike a balance between the complexity of the real world and benchmark usability, ReCoS includes text descriptions that are neither overly detailed, making retrieval overly simplistic, nor under-detailed to the point where retrieval becomes impossible. Our evaluation results shed light on the challenges faced by existing methods, especially in cognition-based retrieval tasks within ReCoS. This underscores the necessity for innovative approaches in addressing the complexities of image-text retrieval in real-world scenarios.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation, [Content] Multimodal Fusion, [Systems] Data Systems Management and Indexing
Relevance To Conference: The paper introduces a novel benchmark for image-text retrieval that aligns with real-world scenarios, bridging the realms of multimedia and multimodal processing. By introducing the ReCoS benchmark dataset, it addresses limitations present in existing benchmarks. ReCoS encompasses a diverse range of retrieval tasks, with particular emphasis on cognition-based tasks, enabling a more comprehensive evaluation of model performance. This integration acknowledges the intertwined relationship between different media types and modalities, showcasing how understanding this relationship is crucial for effective image-text retrieval. Moreover, the dataset utilizes original data sources to mitigate the risk of data leakage and carefully balances the complexity of textual descriptions to ensure more realistic retrieval scenarios. This comprehensive contribution not only advances research and development in image-text retrieval but also highlights the importance of synergy between multimedia and multimodal processing, fostering progress and practical applications in the field.
Supplementary Material: zip
Submission Number: 5483
Loading