AI Alignment Via Power Mean Elicitation

Published: 19 Dec 2025, Last Modified: 05 Jan 2026AAMAS 2026 ExtendedAbstractEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Alignment, Preference Elicitation, Power means
TL;DR: We approximate the underlying welfare concept held by an LLM via an algorithmic framework with provable guarantees, and evaluate our methodology on a number of LLM frameworks
Abstract: Humans and AI make decisions based on their innate/encoded social values. However, the sheer complexity of human/AI decision making processes makes it exceedingly difficult to meaningfully elicit social values. We propose an algorithmic elicitation framework that approximates an unknown welfare concept. More specifically, we devise algorithms that ask agents to determine which social outcome is preferred in order to obtain a power-mean welfare function that well-approximates the agents' (unknown) welfare concept. We show that only $\Theta\left(\log \frac{\log p}{\varepsilon}\right)$ comparison queries are needed in order to obtain an $\varepsilon$ approximate welfare concept, where $p$ is the (unknown) $p$-value in the power-mean welfare function we wish to elicit. We empirically evaluate our methods on a variety of large-language models. Our analysis indicates that some LLMs exhibit utilitarian behavior, whereas others are more egalitarian.
Area: Game Theory and Economic Paradigms (GTEP)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 17
Loading