Keywords: Large language models, moral reasoning, ethical AI, moral sensitivity, vignette design, benchmark evaluation, moral salience, human-AI comparison, ethical decision-making, moral dimensions, feature identification
TL;DR: Current benchmarks overestimate LLMs' moral reasoning capabilities by eliminating the crucial first step of identifying morally relevant features from noisy information.
Abstract: Moral competence is the ability to act in accordance with moral principles. As large language models (LLMs) are increasingly deployed in situations demanding moral competence, there is increasing interest in evaluating this ability empirically.
We review existing literature and identify three significant shortcoming: (i) Over-reliance on pre-packaged moral scenarios with explicitly highlighted moral features; (ii) Focus on verdict prediction rather than moral reasoning; and (iii) Inadequate testing of models' (in)ability to recognize when additional information is needed.
Grounded in philosophical research on moral skills, we then introduce a novel methodology for assessing moral competence in LLMs that addresses these shortcomings.
Our approach moves beyond simple verdict comparisons to evaluate five distinct dimensions of moral competence: identifying morally relevant features, weighting their importance, assigning moral reasons to these features, synthesizing coherent moral judgments, and recognizing information gaps. We conduct two experiments comparing six leading LLMs against both non-expert humans and professional philosophers. In our first experiment using ethical vignettes standard to existing work, LLMs generally outperformed non-expert humans across multiple dimensions of moral reasoning. However, our second experiment, featuring novel scenarios specifically designed to test moral sensitivity by embedding relevant features among irrelevant details, revealed a striking reversal: several LLMs performed significantly worse than humans at identifying morally salient features.
Our findings suggest that current evaluations may substantially overestimate LLMs' moral reasoning capabilities by eliminating the crucial task of discerning moral relevance from noisy information, which we take to be a prerequisite for genuine moral skill. This work provides a more nuanced framework for assessing AI moral competence and highlights important directions for improving (assessment of) ethical reasoning capabilities in advanced AI systems.
Submission Number: 4
Loading