Open-Source Can Be Dangerous: On the Vulnerability of Value Alignment in Open-Source LLMs

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Large language model, Harmful, Alignment
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Large language models (LLMs) possess immense capabilities but are at risk of malicious exploitation. To mitigate the risk, value alignment is employed to align LLMs with ethical standards. However, even after this alignment, they remain vulnerable to jailbreak attacks, which, despite their intent, often face high rejection rates and limited harmful output. In this paper, we introduce reverse alignment to highlight the vulnerabilities of value alignment in open-source LLMs. In reverse alignment, we prove that by accessing model parameters, efficient attacks through fine-tuning LLMs become feasible. We investigate two types of reverse alignment techniques: reverse supervised fine-tuning (RSFT) and reverse value alignment (RVA). RSFT operates by supervising the fine-tuning of LLMs to reverse their inherent values. We also explore how to prepare data needed for RSFT. RVA optimizes LLMs to enhance their preference for harmful content, reversing the models' value alignment. Our extensive experiments reveal that open-source high-performance LLMs can be adeptly reverse-aligned to output harmful content, even in the absence of manually curated malicious datasets. Our research acts as a whistleblower for the community, emphasizing the need for caution when open-sourcing LLMs. It also underscores the limitations of current alignment approaches and advocates for the adoption of more advanced techniques.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2540
Loading