Abstract: The rapid adoption of Large Language Models in user-facing applications has magnified security risks, as adversarial prompts continue to circumvent built-in safeguards with increasing sophistication. Current external safety classifiers predominantly rely on supervised fine-tuning—a computationally expensive approach that proves brittle against novel attacks and demands constant retraining cycles. We present FORTRESS, a Fast, Orchestrated Tuning-free Retrieval Ensemble for Scalable Safety that eliminates the need for costly, gradient-based fine-tuning. Our framework unifies semantic retrieval and dynamic perplexity analysis with a single instruction-tuned LLM, creating an efficient pipeline that adapts to emerging threats through simple data ingestion rather than model retraining. FORTRESS employs a novel dynamic ensemble strategy that intelligently weighs complementary signals: semantic similarity for known threat patterns and statistical anomaly detection for zero-day attacks. Extensive evaluation across nine safety benchmarks demonstrates that FORTRESS achieves state-of-the-art performance with an F1 score of 91.6\%, while operating over five times faster than leading fine-tuned classifiers. Its data-centric design enables rapid adaptation to new threats through simple data ingestion—a process we show improves performance without a latency trade-off—offering a practical, scalable, and robust approach to LLM safety.
Submission Length: Regular submission (no more than 12 pages of main content)
Supplementary Material:  zip
Assigned Action Editor: ~Huazheng_Wang1
Submission Number: 5263
Loading