Unifying Heterogeneous Medical Images Using Large Language Models

Published: 14 May 2025, Last Modified: 04 May 2026ZenodoEveryoneRevisionsCC BY-SA 4.0
Abstract: Reuse of medical image datasets depends on consistent metadata, but inconsistencies between sources hinder interoperability. In the Human Radiome Project, we are collecting data from over 1000 sources into a database to support AI training for a medical foundation model. Manual curation of such heterogeneous data is not feasible at scale, which is why we leverage large language models (LLMs) to automate metadata extraction and unification. By employing a scalable, two-step LLM-based pipeline we extract and standardize unstructured metadata into a unified schema. Our approach improves metadata consistency, enabling more robust and interoperable datasets for downstream AI applications.
Loading