Built-in Safeguards for Auto-regressive Protein Foundation Models through Self-play

Xiaoyi Fu; Yang Xu; King L. CHOW; Yuan Yao

Built-in Safeguards for Auto-regressive Protein Foundation Models through Self-play

Xiaoyi Fu, Yang Xu, King L. CHOW, Yuan Yao

Published: 03 Mar 2026, Last Modified: 26 Apr 2026ICLR 2026 Workshop FM4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Auto-regressive Generative Model, Protein Foundation Model, Built-in Safeguards

Abstract: The biorisk of dual use is rising for protein generative models as such tools proliferate. Recent work in the machine learning community has introduced frameworks for systematically red-teaming protein foundation models to uncover these risks. However, existing frameworks focus primarily on diffusion models, leaving a gap for autoregressive models that generate protein sequences one amino acid at a time, conditioned on a partial sequence or structure. To address this, we extend the current framework to autoregressive protein generative models and propose a built‑in defensive strategy based on model Self‑Play. Empirical results on the SafeProtein benchmark show that a GRPO‑based method significantly outperforms a standard supervised fine‑tuning baseline and exhibits a scaling law for model Self-play. We further examine the impact on generation quality through extensive experiments on two exemplar enzymes.

Submission Number: 112

Loading