Keywords: LLM; guardrail model
TL;DR: We propose an effective, robust, and flexible LLM guardrail model via knowledge-enhanced logical reasoning.
Abstract: various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs and ensure compliance with safety policies. Existing guardrail models, such as OpenAI Mod and LlamaGuard, treat various safety categories (e.g., self-harm, self-harm/instructions) independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories. To address these limitations, we propose $R^2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, $R^2$-Guard comprises two parts: data-driven guardrail models and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories. We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) based reasoning component. The unsafety probabilities of different categories from data-driven guardrail models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. We also propose different methods to optimize the weights of knowledge. To further perform stress tests for guardrail models, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories and presents new challenges for moderation. We show that $R^2$-Guard is effective even given unrepresentative categories or challenging jailbreaking prompts. We demonstrate the effectiveness of $R^2$-Guard by comparisons with eight strong guardrail models on six standard moderation datasets, and demonstrate the robustness of $R^2$-Guard against four SOTA jailbreaking attacks. $R^2$-Guard significantly surpasses SOTA method LlamaGuard by 12.6% on standard moderation datasets and by 59.9% against jailbreaking attacks. We further reveal that $R^2$-Guard can effectively adapt to safety category updates by simply editing the PGM reasoning graph.
Submission Number: 147
Loading