The Alignment Bottleneck

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Alignment, Large Language Model, Reinforcement Learning From Human Feedback, Information Theory, PAC-Bayes
TL;DR: We reframe LLM alignment as an information bottleneck problem, showing that the limited capacity of human feedback imposes hard theoretical limits on performance and explains phenomena like reward hacking.
Abstract: We study feedback alignment for large language models under a finite information budget. The feedback loop is modeled as a two-stage channel $U \to H \to Y$ given context $S$, where $U$ is the target, $H$ is the bounded judgment, and $Y$ is the label. The average capacity $\bar C_{\mathrm{tot}\mid S}$ of this channel constitutes an alignment bottleneck. By applying Fano's inequality to separable codebooks, we derive a minimax lower bound on alignment error that depends on value complexity $\log M$ and capacity but is independent of dataset size. This implies that scaling data cannot eliminate error when the feedback channel is structurally deficient. We further show that the same capacity term controls the environmental budget in a PAC-Bayes generalization bound. These results define a performance interval where optimization beyond the channel capacity fits rater artifacts such as sycophancy. Experiments with Qwen confirm that low-capacity feedback leads to saturation and degradation even as data scales. Our framework suggests that improving alignment requires increasing the channel capacity through richer interfaces or clearer constitutions rather than just collecting more data.
Primary Area: learning theory
Submission Number: 13036
Loading