Keywords: Watermark, LLMs, DNA, AI for Science, AI Safety
Abstract: DNA language models have revolutionized our ability to design and manipulate DNA sequences—the fundamental language of life—with unprecedented precision, enabling transformative applications in therapeutics, synthetic biology, and gene editing. However, this capability also poses significant dual-use risks, including the potential creation of harmful biological agents. To address these biosecurity challenges, we introduce two innovative watermarking techniques: DNAMark and CentralMark. DNAMark employs synonymous codon substitutions to embed robust watermarks in DNA sequences while preserving the function of encoded proteins. CentralMark advances this by creating inheritable watermarks that transfer from DNA to translated proteins, leveraging protein embeddings to ensure detection across the central dogma. Both methods utilize state-of-the-art embeddings to generate watermark logits, enhancing resilience against natural mutations, synthesis errors, and adversarial attacks. Evaluated on a therapeutic DNA benchmark, DNAMark and CentralMark achieve F1 detection scores above 0.85 under diverse conditions, while maintaining over 60\% sequence similarity to ground truth and degeneracy scores below 15\%. A case study on a CRISPR-Cas9 system underscores CentralMark’s utility in real-world synthetic biology. This work establishes a vital framework for securing DNA language models, balancing innovation with accountability to mitigate biosecurity risks.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 12875
Loading