SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Anonymous

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairing.However, current benchmarks for evaluating MMIR performance on image-text pairings overlook the scientific domain, which has a notable gap with the generic data since the caption of scientific charts and tables usually describes the analysis of experimental results or scientific principles in contrast to human activity or scenery depicted in generic images.To bridge this gap, we develop a \textbf{sci}entific domain-specific \textbf{MMIR} benchmark (\textbf{SciMMIR}) by leveraging open-access research paper corpora to extract data relevant to the scientific domain. This benchmark comprises \textbf{530K} meticulously curated image-text pairs, extracted from figures and tables with detailed captions from scientific documents.We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of the baselines. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2.Our findings offer critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the effects of different visual and textual encoders.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

0 Replies

Loading