The Linear Geometry of Moral Choice in LLMs

ICLR 2026 Conference Submission23601 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: moral decision-making, large language models, chain-of-thought reasoning, representation geometry, concept directions, interpretability
Abstract: Moral dilemmas -- short scenarios that impose harm-benefit trade-offs -- exhibit strong framing effects in large language models (LLMs). We show these effects concentrate along a single latent moral direction in hidden states that separates impersonal (observer) from personal (actor) framings. Projection onto this direction predicts baseline yes/no choices, and small controlled steps along it steer decisions while preserving fluency and topical content. Comparing models with the same base architecture, reasoning-tuned variants tend to be more utilitarian and less sensitive to the personal/impersonal distinction, with decisions varying more smoothly as we move along the axis; non-reasoning variants place greater weight on framing. The pattern is robust across alternative geometric constructions, layers, and evaluation windows. We release a transparent evaluation protocol (decision extraction, flip-rate curves, drift audits) and open artifacts. Together, these results provide an interpretable, auditable link between representation geometry and normative choices in LLMs.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 23601
Loading