Abstract: Large language models (LLMs) are widely deployed as zero-shot evaluators for answer grading, content moderation, and document ranking. Yet studies show that guard models (Guards)—LLMs fine-tuned for safety—remain vulnerable to "jailbreak" attacks, jeopardising downstream chatbots.
We confirm this weakness on three public benchmarks (BeaverTails, XSTest, AdvBench) and trace it to representation shifts that arise in the embedding layer and cascade through the Transformer stack.
To counteract the effect, we introduce Gamma-Guard: lightweight residual adapters inserted after the embeddings and at sparse intervals in the model. The adapters start with zero-scaled gates, so they retain the original behaviour; a brief adversarial fine-tuning phase then teaches them to denoise embeddings and refocus attention.
With fewer than 0.1% extra parameters and only a 2% latency increase, Gamma-Guard lifts adversarial accuracy from <5% to 95% a 90 percentage-point gain while reducing clean-data accuracy by just 8 percentage points.
Extensive ablations further show that robustness improvements persist across different layer placements and model sizes.
To our knowledge, this is the first approach that directly augments large Guards with trainable adapters, providing a practical path toward safer large-scale LLM deployments.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Large Language Models, Guard Models, Adversarial Robustness, Residual Adapters
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3143
Loading