RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

ICLR 2026 Conference Submission21537 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Guardrail, LLM safety, RAG
TL;DR: We conduct a systematic evaluation of the robustness of LLM-based guardrails against retrieval augmentation perturbations.
Abstract: With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that **inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11\% and 8\% of cases**, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21537
Loading