Abstract: While traditional methods for calling variants across whole genome sequence data relyon alignment to an appropriate reference sequence, alternative techniques are neededwhen a suitable reference does not exist. We present a novel alignment and assemblyfree variant calling method based on information theoretic principles designed to detectvariants have strong statistical evidence for their ability to segregate samples in a givendataset. Our method uses the context surrounding a particular nucleotide to definevariants. Given a set of reads, we model the probability of observing a given nucleotideconditioned on the surrounding prefix and suffixes of lengthkas a multinomialdistribution. We then estimate which of these contexts are stable intra-sample andvarying inter-sample using a statistic based on the Kullback–Leibler divergence.The utility of the variant calling method was evaluated through analysis of a pair ofbacterial datasets and a mouse dataset. We found that our variants are highly informa-tive for supervised learning tasks with performance similar to standard reference basedcalls and another reference free method (DiscoSNP++). Comparisons against referencebased calls showed our method was able to capture very similar population structure onthe bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitablefor many common analysis tasks for organisms that are too diverse to be mapped backto a single reference sequence.
0 Replies
Loading