Towards Quantization-Adversarial Reparameterizations

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: quantization, unlearning, safety, security
TL;DR: We introduce a set of weight-only, training-free transforms that induce catastrophic failure of post-training quantization while preserving performance at full precision.
Abstract: Post–training quantization (PTQ) of large language models is now routine for latency and cost, but it also enables third parties to convert and redeploy models outside their intended precision regime, while posing risks such as undoing unlearning and failure of safety gaurdrails. We attempt to create a reparamaterized ("encrypted") model that behaves normally at high precision yet fails in a controlled, safe manner once standard PTQ is applied. We present a set of training-free, weight-only transforms that largely preserve full-precision behavior unchanged while being adversarial to PTQ. Concretely: (A) $\textbf{ill-conditioning for error amplification}$ that is numerically tame at BF16 but magnifies fixed-point rounding/clipping; (B) $\textbf{fragile residual encodings}$ that cancel at high precision but reappear as structured biases after rounding; and (C) $\textbf{dynamic scaling traps}$ that provoke clipping or pathological rescaling under PTQ. We observe strong semantic preservation between original and encrypted BF16 models and catastrophic collapse after PTQ on an easy arithmetic benchmark, while original PTQ baselines remain healthy. Our methods require no training, finetuning, extra layers, custom ops, or runtime changes; the reparameterizations can be applied within a few minutes on CPU.
Submission Number: 60
Loading