Keywords: safety, guardrailing, LLM
TL;DR: This work combines guardrail-specific instruction pretraining with few-shot fine-tuning to produce lightweight classifiers to output the SoTA in guardrailing.
Abstract: Large language models (LLMs) have shown promise in guardrailing against undesired behaviors, but their high inference costs, memory consumption, and unstructured outputs can be prohibitive.
In this work we propose guardrail-specific instruction pretraining using a synthetic data generation pipeline. The data generation process is tailored towards generating policies that define the scope of the guardrail, compliant and non-compliant prompts, rationales when non-compliant and the output binary compliant or non-compliant label. From this, we propose a new guardrail model called \texttt{Guardformer} and show when further few-shot fine-tuned it significantly outperforms current state of the art (SoTA) while only requiring 512MB in storage. \texttt{GuardFormer} is orders of magnitude smaller than baselines such as \texttt{gpt-4}, yet significantly outperforms it while having the ability to learn from multiple custom policies at once.
Empirical evaluation across 7 public datasets and 4 novel guardrail benchmarks demonstrates our efficient classifiers' superiority over state-of-the-art LLMs and third-party APIs. Our models achieve average F1 score improvements of \textbf{29.64} and \textbf{21.07} points compared to \text{Aegis-LlamaGuard} and \texttt{gpt-4o}, respectively, in distinguishing safe from unsafe behaviors. Notably, models trained on our synthetic data consistently outperform those trained on real data, even when evaluated against custom-defined guardrailing policies, underscoring the efficacy of our approach.
Submission Number: 110
Loading