Abstract: Genomic variation inference is an important sub problem in whole genome assembly. For the well-established human genome, additional information is available from the reference that allows data representations to depart from assembly graphs. One recent method using a convolutional neural network (CNN) and image-based representation, outperformed other algorithms in resolving single nucleotide variations (SNVs). This same representation was used to call deletion structural variations (SVs) but was unable to be applied to other SV types like inversions. We present a variable topology method along with a novel data representation suitable for a wide range of genomic variation inference. We demonstrate the effectiveness of this representation for SVs by training CNN ensembles with tensors derived from the 1000 genomes phase 3 high coverage dataset for detecting deletion and inversion types and comparing the results to well established methods. Our pure ML method facilitates a new direction in SV inference technique, where feature selection and region filtering are no longer needed to maintain low false positive rates.
External IDs:dblp:conf/bibm/BeckerS20
Loading