Keywords: guard models, llm safety, ai alignment, semantic robustness, robustness to paraphrases
TL;DR: We probe guard models by paraphrasing LLM-generated answers, revealing that they often misjudge intent, exposing a critical gap between their design goal (safety assessment of the answer) and actual behavior (surface-level sensitivity).
Abstract: Guard models are increasingly used to evaluate the safety of large language model (LLM) outputs. These models are intended to assess the semantic content of responses, ensuring that outputs are judged based on meaning rather than superficial linguistic features. In this work, we reveal a critical failure mode: guard models often assign significantly different scores to semantically equivalent responses that differ only in phrasing. To systematically expose this fragility, we introduce a paraphrasing-based evaluation framework that generates meaning-preserving variants of LLM outputs and measures the variability in guard model scores. Our experiments show that even minor stylistic changes can lead to large fluctuations in scoring, indicating a reliance on spurious features rather than true semantic understanding. This behavior undermines the reliability of guard models in real-world applications. Our framework provides a model-agnostic diagnostic tool for assessing semantic robustness, offering a new lens through which to evaluate and improve the trustworthiness of LLM safety mechanisms.
Submission Number: 115
Loading