POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Batuhan K. Karaman; ishmam zabir; Alon Benhaim; Vishrav Chaudhary; Mert R. Sabuncu; Xia Song

POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Batuhan K. Karaman, ishmam zabir, Alon Benhaim, Vishrav Chaudhary, Mert R. Sabuncu, Xia Song

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM safety, LLM usefulness, Overrefusal in LLMs, responsible AI

TL;DR: This paper examines the impact of using superior language models as teachers on the safety-usefulness trade-off in student models, and explores the use of preference optimization methods to reduce overrefusal.

Abstract: Balancing safety and usefulness in large language models has become a critical challenge in recent years. Models often exhibit unsafe behavior or adopt an overly cautious approach, leading to frequent overrefusal of benign prompts, which reduces their usefulness. Addressing these issues requires methods that maintain safety while avoiding overrefusal. In this work, we examine how the overgeneration of training data using advanced teacher models (e.g., GPT-4o), including responses to both general-purpose and toxic prompts, influences the safety and usefulness in instruction-following language models. Additionally, we present POROver, a strategy to use preference optimization methods in order to reduce overrefusal, via employing a superior teacher model's completions. Our results show that overgenerating completions for general-purpose prompts significantly enhances the model's safety and usefulness balance. Specifically, the F1 score calculated between safety and usefulness increases from 74.4\% to 91.8\% due to a substantial increase in safety. Moreover, overgeneration for toxic prompts substantially increases the usefulness from 11.1\% to 57.6\% while maintaining safety. Furthermore, preference optimization algorithms, when applied with carefully curated preference data, can effectively increase a model's usefulness from 57.6\% to 82.1\% while maintaining comparable safety levels.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12792

Loading