Keywords: biosecurity, LLM safety, defense-in-depth, automated red teaming, preference alignment (DPO), runtime policy enforcement, STELLA agent orchestration
TL;DR: A lifecycle defence in depth toolkit within STELLA that hardens text only LLMs for biosafety by combining data sanitization, DPO+LoRA alignment, calibrated guardrails, and continuous red teaming.
Abstract: Large language models (LLMs) are increasingly used for literature triage, drafting, and knowledge access in the life sciences, which creates dual-use risk when unsafe instructions or tacit know-how are elicited. In response, this study operationalizes a defense-in-depth toolkit for biosecurity alignment that spans the full model lifecycle. The system is implemented as a Biosecurity Agent on STELLA and comprises four coordinated modes: dataset sanitization, preference alignment, runtime guardrails, and automated red teaming. For dataset sanitization (Mode 1), evaluation is conducted on CORD-19, the COVID-19 Open Research Dataset of coronavirus-related scholarly articles (Wang et al., 2020). Three keyword strictness tiers are applied (L1, L2, L3). L1 is a compact high-precision seed list for unambiguous risk indicators and is tuned to minimize false positives. L2 is a human-curated list targeting domain-specific biosafety terms. L3 is a comprehensive union of curated lists and represents the highest strictness. Removal increases monotonically with strictness (0.46% at L1, 20.87% at L2, and 70.40% at L3), illustrating the safety–utility trade-off. For preference alignment (Mode 2), DPO with LoRA adapters internalizes refusals and safe completions, reducing end-to-end attack success rate (ASR) from 59.7% (95% CI 55.6–63.7) to 3.0% (1.0–5.0). At inference (Mode 3), the runtime guard configured at L1/L2/L3 exhibits the expected security–usability trade-off. The L2 setting attains the best balance (F1 = 0.720, precision = 0.900, recall = 0.600, FPR = 0.067). In contrast, L3 minimizes jailbreak success at the cost of higher false positives. Under continuous automated red-teaming (Mode 4), no successful jailbreaks are observed under the tested protocol. Taken together, the agent provides an auditable, lifecycle-aligned blueprint that measurably lowers attack success while preserving benign utility and that supports principled operating-point selection under false-positive budgets for high-stakes deployments.
Submission Number: 42
Loading