Resolving multiple structural variation callers and platforms by matrix transformation and imputation
Abstract: Advances in genome sequencing technologies have increased the ability to detect structural variations but this process remains imperfect due to inference limitations for paired-end and long read platforms. In addition to the higher sequencing cost of longer reads, it has been shown that long read platforms have lower inference performance for certain structural variants than the lower cost paired reads. Therefore, it is important to establish an automated information-driven methodology for combining the best parts of any given platform and structural variation calling algorithm to produce a high-quality SV estimate of a new unstudied individual. We detail a novel method for clustering similar variants from an arbitrary number of callers and platforms by transforming them into a standardized matrix with imputed missing values. Especially useful in the new formulation is the ability to represent new unseen variants in a shared space (e.g., a cloud or local DB) with the previously studied examples. This allows our method to be extended for online or semi-supervised learning where gold standard data sets derived from the 1000 Genomes Phase2, Phase3, and Human Genome SV project Phase 1 and Phase 2 can be used to eliminate background noise. We compare our novel ensemble method with leading individual callers and other ensemble methods and show an increase in performance. We showcase the importance of offering the analysis transparency in identifying disease-specific SVs (i.e., orofacial cleft lip and palate) using selected samples from the Gabrielle Miller Kids First Asian Orofacial Cleft cohort. Our work is written in python3 and is fully open source along with several preprocessed supervised learning datasets.
External IDs:dblp:conf/bibm/BeckerLQCS24
Loading