The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data

Xiaolong Luo, Michael Lingzhi Li

Published: 27 Nov 2025, Last Modified: 09 Dec 2025ML4H 2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Electronic Health Records, Multi-institutional Data, Critical Care, Data Processing Pipeline, Healthcare AI, Real-world Data
TL;DR: We present CRISP, an open-source, scalable pipeline that transforms raw multi-institutional OMOP CDM data into ML-ready datasets, enabling reproducible benchmarks on the 1.95B-record CRITICAL dataset and accelerating clinical AI research.
Track: Proceedings
Abstract: Large-scale critical care datasets have driven major progress in clinical AI, yet most remain limited to single institutions. The newly released CRITICAL dataset expands this scope, linking 1.95 billion records from 371,365 patients across four CTSA sites and capturing longitudinal patient journeys from pre-ICU to post-ICU care. Its scale and diversity enable more generalizable modeling but introduce significant challenges in data cleaning, vocabulary harmonization, and computational efficiency. We introduce **CRISP** (*CRITICAL Records Integrated Standardization Pipeline*), a scalable framework that transforms the raw CRITICAL resource into machine-learning–ready form. CRISP performs systematic data validation, cross-vocabulary mapping, and unit standardization while maintaining full auditability. Through parallelized optimization, it processes the entire dataset in under a day on standard computing hardware. The pipeline also provides reproducible baselines across multiple clinical prediction tasks, substantially reducing data preparation time and enabling consistent, multi-institutional evaluation. All code, documentation, and benchmarks are publicly available to support transparent and scalable clinical AI research.
General Area: Impact and Society
Specific Subject Areas: Dataset Release & Characterization
Data And Code Availability: Yes
Ethics Board Approval: Yes
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Submission Number: 242
Loading