Diagnosing Moral Reasoning: A Benchmark for Evaluating Consistency and Robustness in Large Language Models
Keywords: AI Ethics, Reasoning Consistency, Moral Reasoning, Benchmarking, Large Language Models, AI Alignment, Prompt Steerability, Logical Robustness
TL;DR: LMs exhibit critical deficits in logical and ethical consistency. We introduce a benchmark to quantify these reasoning failures, revealing urgent risks for reliable AI.
Abstract: Despite their impressive task generalization, the logical robustness of large language models (LLMs) in complex reasoning domains remains poorly understood. We introduce a novel benchmark to evaluate a critical facet of reasoning: ethical consistency. Our framework probes models with moral dilemmas augmented by clarifying and contradictory follow-ups, extracting concrete yes/no responses to enable rigorous analysis. We propose two diagnostic metrics: an Ethical Consistency Index (ECI) to quantify logical contradictions across scenarios, and an entropy-based score to measure response stochasticity. Evaluating state-of-the-art models against human baselines, we find that LLMs exhibit significant reasoning deficits, achieving only middling consistency. Furthermore, we demonstrate that ethical stance is highly steerable and context-dependent, revealing a lack of robust principles. These results highlight urgent risks for high-stakes deployment and underscore the need for benchmarks that move beyond capability checking to diagnose reasoning processes. We open-source our benchmark to advance the development of more logically consistent and reliable models. (https://anonymous.4open.science/r/TrolleyBench-FD46/README.md)
Submission Number: 168
Loading