A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment

A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment

ICLR 2026 Conference Submission21654 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: biosecurity, large language models, safety alignment, guardrails, red teaming, DPO, LoRA, mixture-of-experts, dataset sanitization, dual-use risk

TL;DR: Defense-in-depth Biosecurity Agent for text-only LLMs: dataset sanitization, DPO+LoRA alignment, multi-signal guardrails with an FPR budget, and automated red teaming; standardized protocol across open-weight models (8B–72B).

Abstract: Large language models are increasingly integrated into biomedical research workflows, from literature triage and hypothesis generation to experimental design. A Biosecurity Agent is operationalized as a defense-in-depth framework spanning the model lifecycle with four coordinated modes: dataset sanitization (Mode 1), preference alignment via DPO+LoRA (Mode 2), runtime guardrails (Mode 3), and automated red teaming (Mode 4). On CORD-19, tiered filtering yields a monotonic removal curve of 0.46% (L1), 20.9% (L2), and 70.4% (L3), illustrating the safety-utility trade-off. Real alignment on Llama-3-8B reduces end-to-end attack success from 59.7% to 3.0% (meeting the ≤5% target); larger models assessed under simulated alignment maintain single-digit residual rates. At inference, the guard calibrated on a balanced 60-prompt set attains F1 = 0.694 at L2 (precision 0.895, recall 0.567, false-positive rate 0.067). Under continuous automated red teaming, the aligned 8B model records no successful jailbreaks under the tested protocol; for larger models, replay under the L2 guard preserves single-digit JSR with low FPR. Taken together, the agent provides an auditable, lifecycle-aligned approach that scales from 8B to ~70B parameters, substantially reducing attack success while preserving benign utility for biology-facing LLM assistance.

Supplementary Material: zip

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 21654

Loading