Adaptive LLM Safety via Inference-Time System-Level Optimization

23 Feb 2026 (modified: 03 May 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) face rapidly evolving security threats, ranging from adversarial attacks like jailbreaking to the leakage of sensitive information that should have been unlearned. Existing defense mechanisms are often static and require extensive model retraining, making them slow to adapt to evolving threats. We investigate whether adaptive, inference-time system designs can mitigate the limitations of static LLM defenses. We study a modular inference-time defense system (which we refer to as AegisLLM). It utilizes a workflow of specialized modules whose defensive policies can be optimized with a remarkably small number of examples to achieve strong performance on multiple, distinct security challenges. We demonstrate the effectiveness of this system on two critical threats: sensitive information disclosure (with \textit{unlearning} as defense) and jailbreaking. On the WMDP benchmark, it approaches the random-guess lower bound for unlearning with only 20 training examples. For jailbreaking, it improves defenses by $\sim$51\% over the base model on the StrongReject benchmark, while maintaining a high utility as measured by the false refusal rate of only 7.9\% on the PHTest benchmark. Furthermore, we show that prompts optimized on one benchmark generalize effectively to others, underscoring the robustness of this approach. Our work highlights the significant advantages of adaptive, system-level security and demonstrates the power of prompt optimization for creating scalable and efficient LLM safety solutions. We provide our code at: \url{https://anonymous.4open.science/r/aegisllm-11B0}.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yang_Zhang15
Submission Number: 7651
Loading