Track: Full / long paper (5-8 pages)
Keywords: COI/COX1 genes, taxonomic classification, alignment-free, genomic foundation models, genomic language models, sequence embeddings, kmers, tokenization, long-tail learning, class imbalance, metabarcoding
TL;DR: Benchmark COI/COX1 taxonomy using frozen gLM embeddings vs k-mers on eKOI and 5.6M-seq MetaCOXI; tokenization, pretraining domain, and imbalance losses shape long-tail Macro-F1.
Abstract: Generative AI in genomics increasingly relies on pretrained genomic foundation models (gLMs) as reusable sequence encoders, yet practical deployment faces persistent barriers: tokenization mismatch with biological signal, domain shift between pretraining corpora and target assays, and extreme long-tail label distributions that stress standard objectives. We study these challenges in the ecologically central COI/COX1 gene by benchmarking an alignment-free pipeline that converts nucleotide sequences into fixed-length embeddings (mean-pooled hidden states) and trains lightweight MLP classifiers for independent rank-wise prediction from Domain to Species. We evaluate two complementary regimes that jointly expose frontiers for scalable genomic representation learning: eKOI (15,947 sequences; protist-rich; 11,047 species) and MetaCOXI (5.6M metazoan sequences; 743,671 species). Across diverse gLM families (autoregressive decoders and masked-language encoders) and explicit compositional baselines (overlapping kmer frequencies up to k=6), we find that the effective motif length induced by tokenization is a dominant driver of fine-rank separability, while corpus alignment (eukaryote- vs. prokaryote-pretraining) materially impacts transfer even under identical tokenization. Finally, imbalance-aware objectives (weighted cross-entropy and a hybrid weighted+contrastive loss) can stabilize rare-taxonomy performance but remain representation-dependent.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 79
Loading