Compressed but Compromised? A Study of Jailbreaking in Compressed LLMs

Satya Sai Srinath Namburi GNVV; Alex James Boyd; Andrew Warrington

Compressed but Compromised? A Study of Jailbreaking in Compressed LLMs

Satya Sai Srinath Namburi GNVV, Alex James Boyd, Andrew Warrington

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Compression, Jailbreaking, Prompt extraction, safety research, pruning, quantization, mechanistic interpretability

TL;DR: We study the susceptibility of compressed models to jailbreaking attacks, examining how various compression methods affect the model robustness.

Abstract: Pretrained large language models, while powerful, are often not immediately usable. These base models are then instruction-finetuned to improve security, align with human-objectives, and resist ``jailbreaking'' or prompt extraction attacks. Post-trained models are then often compressed for real-world applications to reduce runtime cost/latency while preserving performance. In this work, we study the susceptibility of compressed models to jailbreaking attacks, examining how various compression methods affect the model robustness. We find that low levels of pruning (10-30\%) and moderate levels of quantization (up to 4-bit) actually enhances resistance towards jailbreaking attacks, whereas higher compression rates leaves models more vulnerable. We conclude by exploring this phenomenon using refusal direction, a mechanistic interpretability tool, revealing clues into the efficacy of different methods. Our work is an important exploration of the practical interaction between common methods for improving the model performance in the real world.

Submission Number: 16

Loading