From k-mers to Genomic Foundation Models: Benchmarking COX1 Taxonomy under Extreme Class Imbalance

Published: 02 Mar 2026, Last Modified: 08 May 2026MLGenX 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Machine learning for genomics increasingly depends on pretrained genomic foundation models (gLMs) as reusable sequence encoders, yet adoption in biological discovery remains constrained by three linked challenges: tokenization mismatch with biological signal, domain shift between pretraining corpora and downstream assays, and extreme long-tail taxonomic labels that destabilize standard objectives. We study these issues in the ecologically central COI/COX1 gene through an alignment-free benchmark that converts nucleotide sequences into fixed-length embeddings (mean-pooled hidden states) and trains lightweight MLP classifiers for independent rank-wise prediction from Domain to Species. We evaluate two complementary regimes for scalable and interpretable genomic modeling: eKOI (15,947 sequences; protist-rich; 11,047 species) and MetaCOXI (5.6M metazoan sequences; 743,671 species). Across diverse gLM families (autoregressive decoders and masked-language encoders) and explicit compositional baselines (overlapping k-mer frequencies up to k=6), we find that effective motif length induced by tokenization is a dominant driver of fine-rank separability, while corpus alignment (eukaryote- vs. prokaryote-pretraining) materially affects transfer even under identical tokenization. Finally, imbalance-aware objectives (weighted cross-entropy and a hybrid weighted+contrastive loss) can stabilize rare-taxonomy performance but remain representation-dependent.
Submission Number: 75
Loading