Keywords: model quantization, fairness, benchmarking
TL;DR: Post-training quantization causes up to 21% of LLM responses to flip between biased and unbiased, driven by model uncertainty rather than size, creating hidden asymmetric impacts across social groups that standard evaluation metrics completely miss.
Abstract: Post-training quantization reduces the computational cost of large language models but fundamentally alters their social biases in ways that aggregate metrics fail to capture. We present the first large-scale study of 50 quantized models evaluated on **QuantizedBiasBench**, a unified benchmark of 13 closed- and open-ended bias datasets. Despite minimal changes in aggregate bias scores, we identify a phenomenon we term *quantization-induced behavior flipping*, where up to 38% of responses switch between biased and unbiased post-quantization. These flips are strongly driven by model uncertainty, where responses with high uncertainty are 3-11x more likely to change than confident ones. Quantization strength amplifies this effect, with 4-bit quantized models exhibiting 4-6x more behavioral changes than 8-bit quantized models. Critically, these changes create asymmetric impacts across demographic groups, where bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Our findings demonstrate that compression fundamentally alters bias patterns, necessitating crucial post-quantization evaluation to ensure reliability in practice.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22111
Loading