- TL;DR: Development of a Machine Learning algorithm to prioritize disease-causing variants in the human genome. We hope this approach helps diagnose patients with unidentified rare diseases.
- Keywords: Genomic medicine, Bioinformatics, Machine Learning, Rare diseases.
- Abstract: World Health Organisation (WHO) estimates around 400 million cases of rare diseases worldwide and 5000 to 8000 different varieties. Most of them have genetic bases, which hinders medical diagnosis. These conditions usually occur as a consequence of low frequency single nucleotide polymorphism (SNPs). For this reason, genomic studies have had a great impact in accelerating the aforementioned diagnosis. However, common bioinformatics analysis requiere an unfeasible amount of working hours to cover the whole human genome. Our research uses a Machine Learning approach to address this problem. In-silico work aims to classify variants in five groups: benign, likely benign, uncertain significance, likely pathogenic and pathogenic. For coding regions of the genome, which represents less that 2% of total DNA, the typical approach works perfectly well. Yet, there is no established protocol on how to identify disease-causing SNPs in long no-coding regions. If successful, our approach will help prioritize variants from patients suffering from undiagnosed rare diseases. For doing so we make use of manually curated public databases. Here each researcher adds variants specifying its chromosome, location, reference allele, alternate allele and tagged with one of the above mentioned clinical significances. This information is used to annotate each variant with a number of different scores based on its characteristics. Most of these scores use amino acidic properties to determine their pathogenicity, making no-coding variants hard to characterise. Therefore, a particularly precise curation of the dataset is required to avoid biases in the biological properties that determine the pathogenicity of no-coding SNPs. Given our data, the classification process will be divided in two parallel workflows, each one involving two step. Firstly, due to their difference in features, no-coding variants will be separated from protein coding variants. Both datasets will continue the process in parallel. Secondly, a categorical classifier will determine whether each variant has an uncertain significance or not. And thirdly, those variants with some clinical significance will be classified using a continuous algorithm in order to predict their “degree” of pathogenicity. Once the most adequate classifiers are determined, the same workflow (annotation + variant prioritization) will be applied to variants obtained from rare diseases patients. We hope this approach will help us determine the genetic bases of these pathologies, helping diagnose patients in a delicate situation.