Keywords: Compression, Jailbreaking, Prompt extraction, safety research, pruning, quantization, mechanistic interpretability
TL;DR: We study the susceptibility of compressed models to jailbreaking attacks, examining how various compression methods affect the model robustness.
Abstract: Pretrained large language models, while powerful, are often not immediately usable. These base models are then instruction-finetuned to improve security, align with human-objectives, and resist ``jailbreaking'' or prompt extraction attacks.
Post-trained models are then often compressed for real-world applications to reduce runtime cost/latency while preserving performance.
In this work, we study the susceptibility of compressed models to jailbreaking attacks, examining how various compression methods affect the model robustness.
We find that low levels of pruning (10-30\%) and moderate levels of quantization (up to 4-bit) actually enhances resistance towards jailbreaking attacks, whereas higher compression rates leaves models more vulnerable.
We conclude by exploring this phenomenon using refusal direction, a mechanistic interpretability tool, revealing clues into the efficacy of different methods.
Our work is an important exploration of the practical interaction between common methods for improving the model performance in the real world.
Submission Number: 16
Loading