Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
Keywords: calibration, uncertainty quantification, sycophancy, RLHF, reward hacking, GRPO, Expected Calibration Error (ECE)
TL;DR: Sycophantic reward optimisation (GRPO+confidence incentives) can produce a calibration collapse in LLMs: expressed confidence increases while empirical accuracy falls (ECE rises markedly) — shown on Qwen3-8B evaluated on MMLU.
Abstract: Modern large language models (LLMs) are increasingly fine-tuned via
reinforcement learning from human feedback (RLHF) or related reward
optimisation schemes. While such procedures improve perceived helpfulness, we
investigate whether sycophantic reward signals degrade calibration---a property
essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under
three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on
TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO)
that rewards agreement with planted wrong answers. Evaluating on $1{,}000$
MMLU items across five subject domains with bootstrap confidence intervals and
permutation testing, we find that \textbf{sycophantic GRPO produces consistent
directional calibration degradation}---ECE rises by $+0.006$ relative to the
base model and MCE increases by $+0.010$ relative to neutral SFT---though the
effect does not reach statistical significance ($p = 0.41$) at this training
budget. Post-hoc matrix
scaling applied to all three models reduces ECE by
$40$--$64\%$ and improves accuracy by $1.5$--$3.0$ percentage points.
However, the sycophantic model retains the highest post-scaling ECE relative
to the neutral SFT control ($0.042$ vs.\ $0.037$), suggesting that
reward-induced miscalibration leaves a structured residual even after affine
correction. These findings establish a methodology for evaluating the
calibration impact of reward hacking and motivate calibration-aware
training objectives.
Submission Number: 35
Loading