Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh; Sajad Ashkezari; Elham Dolatabadi; Arash Afkanpour

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour

11 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal, Contrastive Learning, Medical

Abstract: Compound figures, which are multi-panel composites containing diverse subfigures, are ubiquitous in biomedical literature, yet large-scale subfigure extraction remains largely unaddressed. Prior work on subfigure extraction has been limited in both dataset size and generalizability, leaving a critical open question: How does high-fidelity image–text alignment via large-scale subfigure extraction impact representation learning in vision-language models? We address this gap by introducing a scalable subfigure extraction pipeline based on transformer-based object detection, trained on a synthetic corpus of 500,000 compound figures, and achieving state-of-the-art performance on both ImageCLEF 2016 and synthetic benchmarks. Using this pipeline, we release Open-PMC-18M, a large-scale high quality biomedical vision-language dataset comprising 18 million clinically relevant subfigure–caption pairs spanning radiology, microscopy, and visible light photography. We train and evaluate vision-language models on our curated datasets and show improved performance across retrieval, zero-shot classification, and robustness benchmarks, outperforming existing baselines. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/vector-institute/open-pmc-18m

Code URL: https://anonymous.4open.science/r/open-pmc-18m-CE25

Supplementary Material: zip

Primary Area: AL/ML Datasets & Benchmarks for health sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 1478

Loading