Genomic Databases Homogenization with Machine Learning

Published: 01 Jan 2023, Last Modified: 05 Feb 2025BIBM 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large-scale and increasingly diverse datasets power modern genomic studies, yet robust data integration and homogenization across varying sources remains a challenge. The multiplicity of file formats and the computational requirements imposed by large genomic datasets make it difficult to deal with multiple data sources. Furthermore, there is a lack of open-source customizable tools to merge genomic databases while providing quality control functionalities. To fill this gap, we present MergeGenome, a machine learning-based method designed to integrate DNA sequences from multiple variant call format (VCF) files while maintaining data quality. By leveraging pre-existing VCF manipulation and imputation software, MergeGenome provides a robust pipeline of comprehensive steps to standardize nomenclature, remove ambiguities, correct strand alignment, eliminate mismatches, impute missing positions, and filter and correct erroneous variants with machine learning, among other functionalities. We demonstrate MergeGenome’s ability to obtain a high-quality combined dataset by merging two databases containing dog DNA and effectively detecting and correcting imputation errors. Finally, we show that using the homogenized dataset provides a boost in phenotype prediction performance.
Loading