LlmFixer: Fix the Helpfulness of Defensive Large Language Models

ACL ARR 2025 May Submission6064 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Defense strategies of large language models besides alignment are introduced to defend against jailbreak attacks, and they have managed to decrease the success rate of jailbreak attacks. However, these defense strategies weakened the helpfulness of large language models. In this work, we propose a universal framework, LlmFixer, acting on large language models equipped with any defense strategy to recover their original helpfulness. LlmFixer consists of an input prompt re-writer and a logic patch. The prompt re-writer is a pre-model for clarifying the intention of input prompts, which promotes large language models to be more helpful to benign inputs and more rejective to malicious inputs. The logic patch is a lightweight structure that enhances large language models' comprehension capacity by supplementing certain logical relationships. Without updating the parameters of a defensive large language model, LlmFixer fixes its helpfulness while preserving safety. Experiments on three large language models, five jailbreak attacks, and four defense strategies show the effectiveness of LlmFixer. The data and code are available.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: safety and alignment, red teaming, robustness
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Theory
Languages Studied: English
Submission Number: 6064
Loading