Unifying Heterogeneous Medical Images Using Large Language Models
Abstract: Reuse of medical image datasets depends on consistent metadata, but inconsistencies between sources hinder interoperability. In the Human Radiome Project, we are collecting data from over 1000 sources into a database to support AI training for a medical foundation model. Manual curation of such heterogeneous data is not feasible at scale, which is why we leverage large language models (LLMs) to automate metadata extraction and unification. By employing a scalable, two-step LLM-based pipeline we extract and standardize unstructured metadata into a unified schema. Our approach improves metadata consistency, enabling more robust and interoperable datasets for downstream AI applications.
External IDs:doi:10.5281/zenodo.15480676
Loading