Long COVID Challenge: Predictive Modeling of Noisy Clinical Tabular DataDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 20 Mar 2024ICHI 2023Readers: Everyone
Abstract: We present an end-to-end machine learning pipeline for aggregating, analyzing, and modeling National COVID Cohort Collaborative (N3C) data on the Enclave system as part of the NIH Long COVID Computational Challenge (L3C). The challenge’s goal is to determine the probability of patients who have tested positive for SARS-CoV-2 in an outpatient or hospital setting (ICU or non-ICU) developing PASC/Long COVID. To achieve this, we have utilized state-of-the-art machine learning algorithms to process millions of clinical observations and identify the most impactful attributes that support accurate prediction modeling. The pipeline is optimized for deployment on N3C Enclave and aims to inform clinical decisions for managing and preventing PASC/Long COVID by identifying the most relevant factors. The study implements four state-of-the-art machine learning methods in PySpark on the Enclave for processing noisy tabular data and a novel robust cascaded fusion model. Results show improved modeling performance for high noise levels in clinical data sources and the highest number of true positives and the lowest number of true negatives for the cascaded model. Multiple conditions, observations, and drugs relevant to Long COVID diagnoses and treatment were also identified.
0 Replies

Loading