From Supervision to Exploration: What Does Protein Language Model Learn During Reinforcement Learning?

Hanqun Cao; Hongrui Zhang; Junde Xu; Zhou Zhang; Lingdong Shen; Minghao Sun; Ge Liu; Jinbo Xu; Wu-Jun Li; Jinren Ni; Cesar de la Fuente-Nunez; Tianfan Fu; Yejin Choi; Pheng-Ann Heng; Fang Wu

From Supervision to Exploration: What Does Protein Language Model Learn During Reinforcement Learning?

Hanqun Cao, Hongrui Zhang, Junde Xu, Zhou Zhang, Lingdong Shen, Minghao Sun, Ge Liu, Jinbo Xu, Wu-Jun Li, Jinren Ni, Cesar de la Fuente-Nunez, Tianfan Fu, Yejin Choi, Pheng-Ann Heng, Fang Wu

15 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep Learning, Protein Design, Reinforcement Learning, Protein Language Model

Abstract: Protein Language Models (PLMs) have achieved significant breakthroughs in computational protein science through pre-training on large-scale sequence databases and leveraging scalable network architectures. Concurrently, Reinforcement Learning (RL) has demonstrated substantial progress across multiple protein design tasks by enabling expanded exploration capabilities and precise multi-objective optimization. While RL has shown transformative potential in natural language processing by enabling models to discover emergent capabilities beyond their training distributions, its capacity to unlock latent functional patterns within protein sequence space remains underexplored. In this study, we investigate whether RL-enhanced PLMs can transcend their pre-training limitations and identify implicit sequence-structure-function relationships not explicitly encoded in foundational datasets. Through systematic evaluation across four critical protein design domains—antimicrobial peptide (AMP) design, kinase optimization, antibody engineering, andinverse folding—we employ diverse RL algorithms and model architectures to address this fundamental question. Our comprehensive analysis demonstrates that RL reliably improves sampling efficiency across domains and, more importantly, that its effectiveness is governed by a three-factor interaction: task difficulty, reward model accuracy, and policy capacity. Gains scale when rewards are accurate and informative, policies have sufficient capacity to realize the signal, and tasks present headroom beyond supervised learning; conversely, noisy rewards or capacity bottlenecks cap improvements despite exploration. This principled view offers practical guidance for RL in protein design: prioritize reward refinement before scaling policy size, match RL algorithms and regularization strength to task difficulty, and allocate capacity where marginal gains are largest.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 6306

Loading