HighClass: Efficient Metagenomic Classification via Quality-Aware Token Mapping and Sparsified Indexing
Keywords: Metagenomics, Taxonomic Classification, Computational Biology, Bioinformatics, Sequence Analysis, Quality-Aware Tokenization, High-Throughput Sequencing
Abstract: Metagenomic classification requires both high accuracy and computational efficiency to process the exponentially growing volume of sequencing data. We present *HighClass*, a novel classification framework that fundamentally transforms the computational paradigm through variable-length token indexing, quality-aware scoring, and learned sparsification.
Our key innovation replaces alignment operations with hash-based token mapping, achieving $O(|\mathcal{T}|)$ complexity while maintaining competitive accuracy. We establish rigorous theoretical foundations: (1) generalization bounds proving $O(\sqrt{V|\mathcal{Y}|/n})$ convergence for vocabulary size $V$ and $|\mathcal{Y}|$ taxa; (2) concentration inequalities under exponential $\alpha$-mixing with explicit dependency factors; (3) consistency guarantees for maximum likelihood classification under identifiability conditions.
HighClass achieves 85.1% F1 on CAMI II—within 1.5% of state-of-the-art—while delivering $4.2\times$ speedup and 68% memory reduction. Variable-length tokens provide 6.8 percentage points improvement over fixed k-mers through superior pattern capture. Quality-aware scoring with learned sensitivity $q_{\text{sens}} = 1.8$ optimally weights sequencing evidence. Gradient-based sparsification retains 32% of genomic regions while preserving 94% accuracy.
Beyond empirical gains, our work establishes the first comprehensive theory of token-based genomic classification, providing uniform convergence guarantees and explicit characterization of dependency effects through $\alpha$-mixing analysis. These results transform sequence classification from heuristic approaches to principled methods with provable guarantees.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 21988
Loading