Structural Persistence Despite Sequence Redaction: A Biosecurity Evaluation of Protein Language Models

Petr Simecek

Structural Persistence Despite Sequence Redaction: A Biosecurity Evaluation of Protein Language Models

Petr Simecek

Published: 15 Oct 2025, Last Modified: 24 Nov 2025BioSafe GenAI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Biosecurity, Protein Language Models, Sequence Redaction, Structural Recovery, Protein Knots, ESM3

TL;DR: Even if most of a protein sequence is deleted, generative AI models can regenerate nearly identical 3D structures—showing that redaction is not a reliable biosecurity safeguard.

Abstract: Redacting or deleting portions of sensitive protein sequences is a common practice intended to reduce dual-use risks when publishing or sharing biological data. However, recent advances in generative protein models challenge the effectiveness of this approach. We demonstrate that even after masking up to 85% of a protein's sequence, state-of-the-art multi-modal models (ESM3) can regenerate sequences that fold into nearly identical three-dimensional structures, preserving complex topological features. Using protein knots as a case study - structures requiring precise spatial arrangements - we show that partial sequence disclosure may not meaningfully reduce the ability of advanced models to reconstruct high-risk proteins. We further demonstrate that models can transform benign proteins into topologically complex variants through iterative modification (31% success rate). These findings highlight critical vulnerabilities in current biosecurity practices and underscore the need for AI-native defenses that address structural recovery directly. We propose our masking-recovery evaluation framework as a benchmark for assessing biosecurity risks in generative biological AI systems.

Submission Number: 33

Loading