TrolleyBench: Evaluating Emergent Moral Reasoning and Consistency in LLMs

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Ethical Consistency, Moral Reasoning, Large Language Models (LLMs), AI Ethics, Benchmarking, Evaluation Metrics, Value Alignment, AI Safety, Steerability, Reasoning Consistency, Behavioral Consistency, Deployment Risks
TL;DR: EthicsBench introduces a novel benchmark to measure moral consistency.
Abstract: Large language models demonstrate remarkable generalization across tasks, yet their reliability in complex domains like moral reasoning remains uncertain. We introduce the first open-source benchmark explicitly designed to evaluate ethical consistency, providing a structured diagnostic of how models handle dilemmas with clarifying and contradictory follow-ups. Our framework yields quantifiable yes/no responses and introduces two novel metrics: the Ethical Consistency Index (ECI) and an entropy-based Inconsistency Score, capturing both contradictions and response variability. Applying the benchmark to state-of-the-art models—Deepseek-R1, Mistral Small-32B, Gemini-2.5, and GPT-4.1-mini—we compare model performance against human baselines. Results show models achieve middling levels of ethical consistency, lacking what is required of it. These findings highlight ethical stance as a steerable but fragile trait, raising concerns for high-stakes deployment. By situating moral reasoning within evaluation, we underscore the need for holistic benchmarks that capture emergent behaviors, and we release our benchmark to foster community progress toward more reliable and aligned LLMs (https://anonymous.4open.science/r/TrolleyBench-FD46/README.md)
Submission Number: 160
Loading