Keywords: Auto-regressive Generative Model, Protein Foundation Model, Built-in Safeguards
Abstract: The biorisk of dual use is rising for protein generative models as such tools proliferate. Recent work in the machine learning community has introduced frameworks for systematically red-teaming protein foundation models to uncover these risks. However, existing frameworks focus primarily on diffusion models, leaving a gap for autoregressive models that generate protein sequences one amino acid at a time, conditioned on a partial sequence or structure. To address this, we extend the current framework to autoregressive protein generative models and propose a built‑in defensive strategy based on model Self‑Play. Empirical results on the SafeProtein benchmark show that a GRPO‑based method significantly outperforms a standard supervised fine‑tuning baseline and exhibits a scaling law for model Self-play. We further examine the impact on generation quality through extensive experiments on two exemplar enzymes.
Submission Number: 112
Loading