Application of Med-PaLM 2 in the refinement of MIMIC-CXR labels

Kendall Park, Rory Sayres, Andrew Sellergren, Tom Pollard, Fayaz Jamil, Timo Kohlberger, Charles Lau, Atilla Kiraly

Published: 01 Jan 2025, Last Modified: 04 Feb 2026PhysioNetEveryoneRevisionsCC BY-SA 4.0

Abstract: MIMIC-CXR is a large, open source dataset that is widely-used in medical AI research. One of the limitations of this dataset is the lack of ground truth labels for the chest X-ray studies. Prior work has extracted structured labels from the MIMIC-CXR radiology report text using CheXpert, a natural language processing (NLP) model. As comprehensive expert validation of these labels is cost-prohibitive, there exists a need for scalable methods of identifying NLP- derived labels that would benefit from manual review. We have developed prompts for extraction of clinically-relevant labels using a clinically- trained large language model, Med-PaLM 2, which we selectively applied to MIMIC-CXR radiology reports. A subset of cases where the Med-PaLM 2 results differed from the previously published CheXpert labels were reviewed by three US board certified radiologists to establish a ground truth. Of these differing labels, Med-PaLM 2 achieved an accuracy of 66%, compared to 19% of CheXpert. Our results demonstrate the potential use of medically-oriented large language models such as Med-PaLM 2 in both label extraction and identifying cases for manual review. This dataset offers 1,378 radiologist- verified ground truth labels to the MIMIC-CXR project.

External IDs:doi:10.13026/a8e5-yx97