Keywords: Adversarial Attacks, Large Language Models
TL;DR: We introduce a novel jailbreak framework specifically targeting private LLMs in black- box settings
Abstract: Large Language Models (LLMs) have greatly advanced the field of General Artificial Intelligence, yet their security vulnerabilities remain a pressing issue, particularly in fine-tuned models. Adversarial attacks in black-box settings—where model details and training data are obscured—are an emerging area of research, posing a substantial threat to private models' integrity. In this work, we uncover a new attack vector: adversaries can exploit the similarities between open-source LLMs and fine-tuned private models to transfer adversarial examples. We introduce a novel attack strategy that generates adversarial examples on open-source models and fine-tunes them to target private, black-box models. Our experiments show that these attacks achieve success rates comparable to white-box attacks, even when private models have been trained on proprietary data. Furthermore, our approach demonstrates strong transferability to other models, including LLaMA3 and ChatGPT.
These findings highlight the urgent need for more robust defenses when fine-tuning open-source LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1587
Loading