Keywords: Preference Modeling, Value Alignment, Evaluation
TL;DR: We propose a scalable and efficient alignment method that emulates human preferences via a compact set of LLM agents.
Abstract: Large language models (LLMs) often collapse toward average responses, obscuring the diversity needed to model different population-level preferences. While prompting can steer models toward diverse responses, it remains a non-trivial challenge on how it can be used to efficiently align with the preference of a target population. We propose a new theoretical lens, preference reconstruction theory, which formalizes population preference alignment as the construction of a functional basis of proxy agents. We implement this via Prompts-to-Proxies (P2P), a framework for preference reconstruction that formulates alignment as a two-stage problem. First, we use structured prompting with entropy-based adaptive sampling to construct a diverse set of endowed agents, each representing a vector in the latent preference space. Second, we reconstruct the population preference by estimating sparse weights over these agents via L1-regularized regression, aligning resulting aggregate response distribution with observed data. This yields a compact proxy population that captures both scope and distribution of preferences without demographic conditioning. P2P offers a cost-effective alternative to large-scale personalization and a principled testbed for studying pluralistic alignment. We validate the approach through an empirical evaluation on 14 waves of the American Trends Panel, demonstrating high-fidelity reconstruction, substantial diversity, and cross-domain generalization.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10236
Loading