Keywords: guard models, safe guardrails, superficial alignment, semantic robustness, Large language models (LLMs)
TL;DR: We diagnose guard models' sensitivity to harmless, non-semantic rewording and propose a self-supervised, parameter-efficient training method that enforces paraphrase consistency, cutting label flips by up to 60% without hurting accuracy.
Abstract: Guard models are critical for ensuring the safety of large language model (LLM) outputs, yet they remain vulnerable to superficial linguistic variation. We show that semantically equivalent paraphrases can cause large fluctuations in guard model safety scores, revealing a lack of semantic grounding. To address this, we introduce a two-stage framework: (1) a paraphrasing-based evaluation protocol that quantifies semantic robustness, and (2) a robust training strategy that enforces paraphrase consistency through self-supervised regularization. Our method constructs paraphrase sets for each response, computes a conservative set-level target probability via a skew‑aware estimate, and applies parameter-efficient fine-tuning to align guard model predictions across these variants. Our approach reduces rewording‑induced variability and even improves benchmark accuracy, whereas naive targets such as the mean or median can degrade accuracy in our ablations. These results motivate treating semantic robustness as a first‑class objective and offer a practical, parameter‑efficient recipe for guard models that prioritize meaning over surface form.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19289
Loading