Keywords: Large Language Models (LLMs), Moral Judgment, Attention Mechanisms, AI Alignment, Ethical Dilemmas
Abstract: We study how large language models (LLMs) differ in moral judgment when prompted for fast, intuition-like answers versus explicit reasoning. Across taboo dilemmas, trolley problems, and AI principle-conflict scenarios, non-reasoning models align more closely with human intuitions, while reasoning-enabled models tend to favor consequentialist or rule-based choices, sometimes overriding autonomy and privacy. Our findings make two contributions: (i) a controlled evaluation framework for isolating the behavioral effects of reasoning in LLMs, and (ii) empirical evidence that reasoning capabilities can induce normative shifts misaligned with human values. These results highlight a structural tension in model alignment: as LLMs become more capable of reasoning, they may not converge toward human-like ethics, but instead follow paths that abstract away moral intuitions. This raises critical questions for the design of safe and aligned artificial general intelligence.
Submission Number: 96
Loading