Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort; Tushar Karayil; Urja Pawar; Robert Graham; Alex McKenzie; Dmitrii Krasheninnikov

Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models (LLMs), bias evaluation, moral decision-making, alignment, contextual influence

TL;DR: We ask LLMs trolley-style moral dilemmas with added contextual pressure towards either option and characterize the effects of this added context on preferences

Abstract: Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences sometimes backfire, showing non-monotonic behavior; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.

Submission Number: 96

Loading