Keywords: Protein Language Modeling, Mixture of Experts, Sparse Models, Soft Mixture of Experts, Expert Choice Routing, Bidirectional Encoders, Masked Language Modeling, Routing Stability
TL;DR: We bring Soft-MoE and Expert-Choice routing to bidirectional protein encoders for the first time, and find that Soft-MoE trains stably while Expert-Choice exhibits routing-driven instability whose severity depends on MoE layer placement.
Abstract: Scaling protein language models has predominantly relied on increasing dense transformer parameters, while loss-free MoE routing strategies naturally suited to bidirectional architectures, such as Soft-MoE and Expert-Choice, remain unexplored in protein encoders. We bring both routing strategies to protein encoders and systematically compare them under identical compute budgets. We find that Soft-MoE trains stably, while Expert-Choice exhibits cascading routing instability that forces early termination when MoE layers are stacked consecutively and produces recurrent transient collapses when interleaved with dense layers. Despite training on only 39B tokens, roughly 1/25th the budget of ESM-2, our Soft-MoE variants match or exceed ESM-2 3B on Fluorescence and GB1 and ESM-2 650M on Stability, while trailing on Remote Homology, Secondary Structure, and ProteinGym by margins consistent with the token gap. Within Soft-MoE, ablating router-side $L_2$ normalization reveals that $L_2$ functionally disables the dispatch step yet matches the unconstrained variant on downstream tasks, suggesting dispatch contributes less to representation quality at this scale than the standard Soft-MoE framing predicts. Together, these results position continuous routing as a more robust foundation than discrete top-$k$ for MoE-based protein encoders.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 43
Loading