SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

ACL ARR 2025 May Submission2645 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have demonstrated unprecedented capabilities across various natural language processing tasks, yet they remain vulnerable to jailbreaking attacks designed to deliberately induce harmful content generation. Despite numerous defensive efforts, we lack understanding of whether models' safety behaviors change when assuming different roles. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. To address this gap, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement without additional annotated data or external models. Extensive experiments across various LLMs and jailbreaking attacks demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines, effectively narrowing the discrimination-generation safety gap while maintaining utility on general benchmarks. Our approach enables mutual benefit between LLMs' discrimination and generation capabilities, resulting in robust performance against out-of-distribution (OOD) jailbreaking attacks. Additionally, we find that SDGO can be further enhanced by fine-tuning with a small amount of harmful-labeled discrimination samples, indicating that SDGO effectively transforms discrimination into part of the model's generation, achieving tight coupling between these two aspects.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/fairness evaluation; model bias/unfairness mitigation; safety and alignment

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Keywords: AI safety; Trustworthy AI; Jailbreak and Defense; Red team; Safety and Alignment

Submission Number: 2645

Loading