Towards Understanding Hybrid Protein Language Model Design: A Systematic Ablation and Interpretability Study
Keywords: protein language models, mixture of experts, state space models, hybrid architectures, masked language modeling, architectural ablation, expert specialization, secondary structure prediction, protein representation learning
Abstract: Protein language models (PLMs) have emerged as powerful tools for learning sequence representations for diverse downstream prediction tasks and protein design. Recent PLMs incorporating mixture-of-experts (MoE) layers, state-space models (SSMs), and hybrid SSM-attention architectures have shown strong performance; however, the individual and combinatorial contributions of these components remain poorly characterized. Here, we systematically evaluate how MoE, SSM, and hybrid architectures—both in isolation and in combination—affect PLM performance. For each component, we examine how key hyperparameters, including hybrid layer ratios, and expert sparsity, influence performance. We find that hybrid architectures combining attention and SSM layers consistently outperform parameter-matched non-hybrid baselines. In contrast, incorporating MoE layers improves performance on some tasks while degrading it on others relative to dense baselines. For MoE models, we find that moderate expert sparsity optimizes the tradeoff between effective model capacity and routing stability. Mechanistic analysis of expert routing in MoE-hybrid models reveals emergent specialization aligned with protein secondary-structure characteristics. Together, these findings provide an empirically grounded framework for designing the next generation of high-capacity, compute-efficient PLMs.
Submission Number: 108
Loading