FORTRESS: Fast, Tuning-Free Retrieval Ensemble for Scalable LLM Safety

Chi-Wei Chang; Richard Tzong-Han Tsai

FORTRESS: Fast, Tuning-Free Retrieval Ensemble for Scalable LLM Safety

Chi-Wei Chang, Richard Tzong-Han Tsai

Published: 12 Oct 2025, Last Modified: 12 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid adoption of Large Language Models in user-facing applications has magnified security risks, as adversarial prompts continue to circumvent built-in safeguards with increasing sophistication. Current external safety classifiers predominantly rely on supervised fine-tuning—a computationally expensive approach that proves brittle against novel attacks and demands constant retraining cycles. We present FORTRESS, a Fast, Orchestrated Tuning-free Retrieval Ensemble for Scalable Safety that eliminates the need for costly, gradient-based fine-tuning. Our framework unifies semantic retrieval and dynamic perplexity analysis with a single instruction-tuned LLM, creating an efficient pipeline that adapts to emerging threats through simple data ingestion rather than model retraining. FORTRESS employs a novel dynamic ensemble strategy that intelligently weighs complementary signals: semantic similarity for known threat patterns and statistical anomaly detection for zero-day attacks. Extensive evaluation across nine safety benchmarks demonstrates that FORTRESS achieves state-of-the-art performance with an F1 score of 91.6\%, while operating over five times faster than leading fine-tuned classifiers. Its data-centric design enables rapid adaptation to new threats through simple data ingestion—a process we show improves performance without a latency trade-off—offering a practical, scalable, and robust approach to LLM safety.

Submission Length: Regular submission (no more than 12 pages of main content)

Supplementary Material: zip

Assigned Action Editor: ~Huazheng_Wang1

Submission Number: 5263

Loading