Abstract: As large language models (LLMs) become integral to decision-making in everyday life, understanding their moral reasoning capabilities is increasingly critical. In this study, we present a critical finding necessary for the responsible development of AI: \textit{LLMs often fail to engage in genuine moral reasoning and are alarmingly vulnerable to prompt injections manipulations} that can shift their ethical stance with success rates between 21\% and 97\%. To systematically evaluate this vulnerability, we introduce the Immorality Leaning Gap, a novel benchmark designed to quantify the extent to which language models exhibit a bias toward immoral scenarios regardless of actions or outcomes. We examined the potential of LLMs to align with normative ethical standards and found that, while they can reflect shared moral norms, they are highly susceptible to prompt manipulation. These findings reveal a critical vulnerability in current AI systems and mark a key step toward developing more ethically robust models.
Paper Type: Short
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Ethics, Moral Reasoning, Ethical LLM, Ethical AI, Bias, Fairness, Norm, Safety, Machine ethics, AI safety
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 6807
Loading