Abstract: In this study, we introduce a novel approach for undermining the alignment of large language models (LLMs), which we term the Module Attack. A module attack compromises the alignment of a model by manipulating intermediate modules in the LLM by changing the internal structure of the model through module swapping. Unlike traditional prompt-based jailbreak attacks, which rely on external inputs and have limited effectiveness, we show that module attacks can bypass alignment defense mechanisms by exploiting structural vulnerabilities inside the LLM and can be answered without going through a separate prompt engineering process.
We also propose a cooperative decoding approach that alternately generates tokens from the attacked LLM and the original LLM during token generation.
In conclusion, we achieved high ASRs, reaching 100\% in most cases, across different LLM architectures (Qwen 2.5, Llama 3.1, Mistral v0.3), and found no difference in ASR between generation using the attacked LLM alone and cooperative decoding with the original LLM. We also showed that a simple swap of internal modules in the LLM can break the alignment of the model without any prompt engineering. This is a methodology that can neutralize the alignment of a model faster than any other methodology without any prior action.
This research provided a deep understanding of the structural vulnerabilities of LLMs and confirmed that manipulating modules in LLMs can easily lead to unwanted consequences.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Ethics, Bias, and Fairness, Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 3164
Loading