Abstract: The comparison of genomic sequences is a important undertaking, for example in phylogenetic and differential sequence analyses. In this work we describe three filter designs that can be applied to genomic power spectra (PS) of any lengths to reduce their size while maintaining the relative distances which they provide and are relevant for data reduction, sorting, and correlation studies of an ensemble of sequences. Specifically we present: Minimal Variance Filtering (MVF), where the subsets of coefficients with the highest variance across a sample are selected, Automated Filter Learning (AFL), where a set of linear combinational filters are learned automatically by a 1- D deep convolutional neural network attempting to classify sequences on region of origin, and Maximal Variance Principal Components Filters (MVPCF) that provide a set of filters in the Principal component loadings determined among the highest variance elements of the PS for a sample. We provide a comparison of these approaches by examining their conservation of distances produced by the entire PS, and conclude with remarks about the benefits and drawbacks of each method while providing future avenues of pursuit for this research.
Loading