To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
Keywords: pluralistic alignment, AI safety, evaluation, high-stakes decision making
Abstract: Language models deployed in high-stakes professional settings face a pluralistic alignment problem when users, institutional authorities, and professional standards issue competing demands. How a model resolves such conflicts reveals an implicit principal hierarchy—an ordering over stakeholders that determines, for instance, whether a medical AI follows a hospital administrator's cost-reduction directive or refuses on evidence-based grounds. Across 7,136 scenarios in legal and medical domains, we evaluate ten frontier models and find that their hierarchies are unstable: models uphold professional standards on advisory questions but frequently fail to do so on execution requests (e.g., drafting) with identical content; user-versus-authority orderings differ between medicine and law; and patterns diverge across model families. The dominant failure mechanism is knowledge omission—harmful output produced without surfacing facts the model demonstrably possesses. In a particularly troubling instance, a reasoning model flags a drug as withdrawn in its reasoning trace yet suppresses this fact and recommends the drug under authority pressure. Inconsistent behavior across task framing, domain, and model family suggests that current alignment methods, including published hierarchy specifications, are unlikely to be robust when models are deployed in high-stakes professional settings.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 61
Loading