MLBF-PRS: A MACHINE LEARNING MODEL DE- VELOPMENT AND BENCHMARKING FRAMEWORK FOR POLYGENIC RISK SCORES
Keywords: Machine Learning, PRS, PGS, Benchmarks, Nextflow, Pipeline, Polygenic score, Polygenic risk score
Abstract: In contrast to other genomic tasks, the development of machine learning-based individual-level, genome-wide predictive models, typically termed polygenic risk scores (PRS), have shown little improvement from the use of complex machine learning (ML) methods. This disparity can be attributed to challenges in accessibility, comparability across studies, and a lack of development and evaluation guidelines that enable reproducibility. Sequence-based genomic tasks benefit from benchmarks, which have proven to be fruitful in the advancement of machine learning model development across domains.
To overcome the challenges present in the development of ML-based PRS models, we introduce MLBF-PRS, a novel framework as a catalyst to promote and accelerate the development of ML-based solutions. The framework provides flexible Nextflow DSL2 pipelines that enable parallel comparison of ML models (SVMs, random forests, neural networks) against established statistical PRS methods, comprehensive quality control and data preparation modules following PRS-specific best practices, and automated tracking of model parameters, trained weights, and configurations to ensure full reproducibility.
We describe the usage of MLBF-PRS to showcase how this framework provides accessibility, where, in most cases, the setup and evaluation of PRS models can be time-consuming and require navigation of multiple software tools. The standardised and reproducible dataset-specific benchmarking through MLBF-PRS offers a practical alternative to traditional open benchmarks. We make our framework openly available and continue expanding its capabilities.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 24789
Loading