SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

ACL ARR 2025 May Submission3486 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have achieved groundbreaking progress in Natural Language Processing (NLP). Despite the numerous advantages of LLMs, they also pose significant safety risks. Self-evaluation mechanisms have gained increasing attention as a key safeguard to ensure safe and controllable content generation. However, LLMs often exhibit overconfidence, which seriously compromises the accuracy of safety self-evaluation. To address this challenge, we propose \textbf{SafeConf}, a method to enhance the safety self-evaluation capability of LLMs through confidence calibration. The method performs semantic mutations on the original safety evaluation questions and adopts a self-consistency strategy to quantify confidence based on answer accuracy on the mutated questions. Finally, these confidence scores are used to construct a dataset for fine-tuning. We conducte experiments on both Chinese and English datasets. The results show that SafeConf improves self-evaluation accuracy by an average of 5.86\% and 7.79\% over the state-of-the-art baseline methods on Qwen2.5-7B-Instruct and Llama3-8B-Instruct models, respectively, without affecting the general capabilities of the models.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: safety evaluation,fine-tuning,security and privacy

Contribution Types: NLP engineering experiment

Languages Studied: Chinese,English

Submission Number: 3486

Loading