Gene-M1: Advancing Cross-Species Genomic Discovery via Taxon-Specific Mixture-of-Experts

Yuhang Li; Jiaqi Tang; Jianmin Chen; Yourui Han; Xuequn Shang; Bolin Chen

Gene-M1: Advancing Cross-Species Genomic Discovery via Taxon-Specific Mixture-of-Experts

Yuhang Li, Jiaqi Tang, Jianmin Chen, Yourui Han, Xuequn Shang, Bolin Chen

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cross-species genomics；Mixture-of-Experts (MoE)；Taxon-specific modeling；Genomic Discovery

TL;DR: GENE-M1 introduces a taxonomy-aligned Mixture-of-Experts architecture for robust cross-species genomic representation learning.

Abstract: Prevailing genomic foundation models rely on a uniform architecture across all species, which overlooks evolutionary divergence and leads to feature interference and limited cross-species generalization. To address this, we introduce GENE-M1, a novel Mixture-of-Experts (MoE) framework strictly governed by biological taxonomy. Our method builds on three core components: (1) a hierarchical expert architecture that instantiates specialized modules for taxonomic ranks (Domain, Kingdom, Phylum, Class) to enable taxon-specific processing; (2) a dynamic router that activates expert pathways aligned with a sequence’s taxonomy, ensuring hierarchical feature extraction; and (3) a progressive training strategy that transfers knowledge from higher to lower taxonomic ranks for stable optimization. In addition, we construct GM-DATA, a large-scale, taxonomically aligned benchmark comprising 294 species spanning 5 Kingdoms, 18 Phyla, and 62 Classes, with broad and balanced coverage across major clades, as well as a held-out GM-DATA(eval) set of 15 unseen species for rigorous cross-species evaluation. Extensive experiments on this benchmark show that GENE-M1 significantly outperforms state-of-the-art baselines in few-shot classification and unsupervised clustering, demonstrating that explicit taxonomic alignment is key to robust and interpretable genomic representation learning. We will release our model, code, and dataset soon.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 8598

Loading