Keywords: model quantization, fairness, benchmarking
TL;DR: Post-training quantization causes up to 21% of LLM responses to flip between biased and unbiased, driven by model uncertainty rather than size, creating hidden asymmetric impacts across social groups that standard evaluation metrics completely miss.
Abstract: Aggregate bias metrics are fundamentally misleading for quantized large language models: quantization causes up to 21% of individual responses to flip between biased and unbiased states while aggregate scores remain unchanged, creating invisible harms that standard evaluations fail to detect. To reveal this hidden phenomenon, we introduce **PostTrainingBiasBench**, a unified framework for rigorous bias evaluation, and conduct the first large-scale study of 50 quantized models across 13 closed- and open-ended benchmarks. We find these flips are strongly linked to model uncertainty: uncertain responses are 3-11x more likely to change than the confident ones. Through controlled intervention via preference optimization, we establish causal evidence that uncertainty drives response flipping. Quantization strength amplifies the effect (4-bit quantized models show 4-6x more behavioral changes than 8-bit). Critically, these shifts asymmetrically impact demographic groups, with bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Together, our findings demonstrate that compression fundamentally reshapes bias patterns, underscoring the need for rigorous post-quantization evaluation and interventions to ensure reliability in practice.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22111
Loading