Optimization and Tokenization Strategies for Biological Foundation Models: Evaluating H-Net and Muon

07 Sept 2025 (modified: 16 Oct 2025)Submitted to NeurIPS 2025 2nd Workshop FM4LSEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Biological foundation models, multimodal learning, H-Net, Muon optimizer, efficient training
TL;DR: Muon boosts stability, convergence, and perplexity; H-Net improves pre-training perplexity
Abstract: Biological foundation models are powerful tools for modeling DNA and protein sequences, but their performance depends heavily on tokenization strategies—from BPE and k-mers to single-nucleotide resolution—each imposing rigid inductive biases. While recent architectures like Hyena and Mamba2 achieve strong performance using single nucleotide/amino acid resolution, this reliance on fixed granularity may not align with biology's natural organization. H-Net, a recently proposed architecture that replaces static tokenization with dynamic chunking learned end-to-end through gradient descent, offers a solution by allowing models to discover meaningful boundaries directly from biological data. We extend H-Net to biological sequences, incorporating a Projected Gated Convolutional (PGC) routing module to capture local motifs, and show that on parameter-matched HG38 pretraining H-Net outperforms Mamba2 while achieving strong performance on supervised protein tasks. We further evaluate the Muon optimizer, which has not previously been applied to proteins. Muon consistently improves convergence speed and stability across architectures, including H-Net, Mamba2, and Transformer, delivering both faster training and better final perplexity. These results highlight the value of exploring both architectural innovations and optimization methods as the field moves toward multimodal biological foundation models that require flexible and efficient training.
Submission Number: 86
Loading