Module Attack: Exploiting Module Swapping to Compromise LLM Alignments

Module Attack: Exploiting Module Swapping to Compromise LLM Alignments

ACL ARR 2025 February Submission3164 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this study, we introduce a novel approach for undermining the alignment of large language models (LLMs), which we term the Module Attack. A module attack compromises the alignment of a model by manipulating intermediate modules in the LLM by changing the internal structure of the model through module swapping. Unlike traditional prompt-based jailbreak attacks, which rely on external inputs and have limited effectiveness, we show that module attacks can bypass alignment defense mechanisms by exploiting structural vulnerabilities inside the LLM and can be answered without going through a separate prompt engineering process. We also propose a cooperative decoding approach that alternately generates tokens from the attacked LLM and the original LLM during token generation. In conclusion, we achieved high ASRs, reaching 100\% in most cases, across different LLM architectures (Qwen 2.5, Llama 3.1, Mistral v0.3), and found no difference in ASR between generation using the attacked LLM alone and cooperative decoding with the original LLM. We also showed that a simple swap of internal modules in the LLM can break the alignment of the model without any prompt engineering. This is a methodology that can neutralize the alignment of a model faster than any other methodology without any prior action. This research provided a deep understanding of the structural vulnerabilities of LLMs and confirmed that manipulating modules in LLMs can easily lead to unwanted consequences.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Ethics, Bias, and Fairness, Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 3164

Loading