Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

Hamid Osooli; Kareema Batool; Rick Gentry; Tiasa Singha Roy; Ashwin Gupta; Anirudha Ramesh

Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

Hamid Osooli, Kareema Batool, Rick Gentry, Tiasa Singha Roy, Ashwin Gupta, Anirudha Ramesh

06 Feb 2026 (modified: 14 Apr 2026)Submitted to AFAA 2026EveryoneRevisionsBibTeXCC BY 4.0

Track: Main Papers Track (6 to 9 pages)

Keywords: weak-to-strong alignment, misfit-based risk bounds, bias-variance-covariance decomposition, deception in preference learning

TL;DR: We show that deception in weak-to-strong alignment is driven by covariance between weak and strong models, not bias or variance alone, and that reinforcement learning on the strong model reduces deception by breaking this covariance.

Abstract: Weak-to-strong alignment has emerged as a central paradigm for scalable supervision, yet it introduces new risks when strong models are trained using feedback generated by imperfect teachers. In this work, we analyze weak-to-strong alignment through a bias-variance perspective by connecting misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on the strong model’s population risk under weak supervision and decompose this bound into bias, variance, and covariance components that capture both teacher quality and student deviation. We empirically study four weak-to-strong training pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning from AI feedback (RLAIF) on the PKU-SafeRLHF and HH-RLHF datasets. Our results show that bias and variance alone are insufficient to explain deceptive behavior, and instead, covariance alignment between weak and strong reward models plays a dominant role. In particular, supervised fine-tuning tends to preserve low-variance alignment but can amplify weak-model inductive biases, whereas reinforcement learning applied to the strong model suppresses deception by disrupting covariance alignment, even when theoretical alignment bounds increase. These findings highlight fundamental limitations of misfit-based bounds as standalone safety indicators and emphasize the importance of controlling weak-strong interactions in alignment pipelines.

Submission Number: 54

Loading