Gamma-Guard: Lightweight Residual Adapters for Robust Guardrails in Large Language Models

Gamma-Guard: Lightweight Residual Adapters for Robust Guardrails in Large Language Models

ACL ARR 2025 May Submission3143 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are widely deployed as zero-shot evaluators for answer grading, content moderation, and document ranking. Yet studies show that guard models (Guards)—LLMs fine-tuned for safety—remain vulnerable to "jailbreak" attacks, jeopardising downstream chatbots. We confirm this weakness on three public benchmarks (BeaverTails, XSTest, AdvBench) and trace it to representation shifts that arise in the embedding layer and cascade through the Transformer stack. To counteract the effect, we introduce Gamma-Guard: lightweight residual adapters inserted after the embeddings and at sparse intervals in the model. The adapters start with zero-scaled gates, so they retain the original behaviour; a brief adversarial fine-tuning phase then teaches them to denoise embeddings and refocus attention. With fewer than 0.1% extra parameters and only a 2% latency increase, Gamma-Guard lifts adversarial accuracy from <5% to 95% a 90 percentage-point gain while reducing clean-data accuracy by just 8 percentage points. Extensive ablations further show that robustness improvements persist across different layer placements and model sizes. To our knowledge, this is the first approach that directly augments large Guards with trainable adapters, providing a practical path toward safer large-scale LLM deployments.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Large Language Models, Guard Models, Adversarial Robustness, Residual Adapters

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3143

Loading