Local Fine-Tuning for Efficient Jailbreaking LLMs

Local Fine-Tuning for Efficient Jailbreaking LLMs

ACL ARR 2025 February Submission4994 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various natural language processing tasks. However, they remain vulnerable to adversarial inputs, known as jailbreak attacks, which are deliberately crafted to elicit harmful or undesirable responses. Among existing attack methods, optimization-based approaches achieve high success rates but are often impractical for black-box models. In this work, we focus on the common scenario where private LLMs are fine-tuned from public LLMs, as fine-tuning large models is more feasible in real-world applications. To address this challenge, we propose a local fine-tuning approach on attacks optimized from open-source LLMs, effectively transforming a black-box attack into an easier white-box problem. This enables the application of existing optimization-based attack frameworks to nearly all LLMs. Our experiments show that these attacks achieve success rates comparable to white-box attacks, even when private models have been trained on proprietary data. Furthermore, our approach demonstrates strong transferability to other models, including LLaMA3 and ChatGPT.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Jailbreak Attack, Large Language Models

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4994

Loading