The Alignment Trilemma: A Theoretical Perspective on Recursive Misalignment and Human-AI Adaptation Dynamics

Published: 06 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: alignment trilemma, recursive misalignment, game theory, replicator dynamics, Nash equilibrium, adaptation dynamics
TL;DR: Paper shows how AI alignment involves three competing objectives that form a mathematical trilemma, using game theory to prove that a mixed-strategy equilibrium outperforms pure optimization of any single objective.
Abstract: We introduce the \emph{Alignment Trilemma} as a theoretical framework to explain the recursive misalignment observed in contemporary AI alignment methods. Our formulation decomposes misalignment into three interdependent components---direct alignment, capability preservation, and meta-alignment---whose conflicting optimization can trigger cycles of drift. In light of recent work on human-AI adaptation dynamics \citep{Shen2024Bidirectional, Carroll2024DRMDP, Harland2024MORL} and adaptive teaming architectures \citep{Ni2021Adaptive, Mahmood2024Behavior}, we propose a holistic approach that includes a novel metric, the \emph{Alignment Performance Score (APS)}, which captures the overall quality of alignment across these three dimensions. Our insights aim to guide the development of AI systems that co-evolve safely with human partners.
Submission Type: Short Paper (4 Pages)
Archival Option: This is an archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 28
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview