Emergent Alignment Via Competition

Emergent Alignment Via Competition

ICLR 2026 Conference Submission13145 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: alignment, bayesian persuasion, learning agents

Abstract: Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic setting where a human user interacts with multiple differently misaligned AI agents, none of which are individually well-aligned. Our key insight is that when the user’s utility lies approximately within the convex hull of the agents’ utilities, a condition that becomes easier to satisfy as model diversity increases, strategic competition can yield outcomes comparable to interacting with a perfectly aligned model. We model this as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties, and prove three results: (1) when perfect alignment would allow the user to learn her Bayes-optimal action, she can also do so in all equilibria under the convex hull condition; (2) under weaker assumptions requiring only approximate utility learning, a non-strategic user employing quantal response achieves near-optimal utility in all equilibria; and (3) when the user selects the best single AI after an evaluation period, equilibrium guarantees remain near-optimal without further distributional assumptions. We complement the theory with two forms of empirical evidence: First, we perform simulations of the best-AI selection game using best response dynamics, which show that competition among individually misaligned agents reliably improves user utility when the approximate convex hull assumption is satisfied, but does not always when it fails. Second, we show that synthetically generated AI utility functions (produced via perturbations of the same prompt to evaluate instances on a movie recommendation (MovieLens) and ethical judgement (ETHICS) dataset) quickly produce a convex hull that contains a good approximation of a given utility function even when none of the individual LLM utility functions is well aligned.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 13145

Loading