Position: The Complexity of Perfect AI Alignment - Formalizing the RLHF Trilemma

Published: 08 Nov 2025, Last Modified: 23 Nov 2025ResponsibleFM @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLHF, AI Alignment, Democratic Representation
TL;DR: No RLHF system can be fully representative, efficient, and robust all at once — you can’t have all three simultaneously.
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the dominant approach for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the **Alignment Trilemma**: no RLHF system can simultaneously achieve (i) $\varepsilon$-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) $\delta$-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness ($\varepsilon \leq 0.01$) and robustness ($\delta \leq 0.001$) for global-scale populations requires $\Omega(2^{d_{\text{context}}})$ operations—super-polynomial in the context dimensionality. We demonstrate that current RLHF implementations resolve this trilemma by sacrificing representativeness, collecting only $10^3$–$10^4$ samples from homogeneous annotator pools while requiring $10^7$–$10^8$ samples for true global representation. Our framework provides a unifying explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.
Submission Number: 132
Loading