Keywords: safety alignment, boundary guidance, reinforcement fine-tuning, classifier filtering, uncertainty
TL;DR: We show that LLMs trained to generate text that is easily classifiable as safe, rather than merely safe, achieve better performance when paired with a filter.
Abstract: Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier’s decision boundary, increasing both false positives and false negatives. We propose *Boundary Guidance*, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier’s margin. On a benchmark of jailbreak and ambiguous prompts, *Boundary Guidance* improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales
and reward designs demonstrate the robustness of our approach.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23992
Loading