Moral Bottleneck Models: A Computational Framework for Effective and Interpretable Moral Evaluation

Moral Bottleneck Models: A Computational Framework for Effective and Interpretable Moral Evaluation

ACL ARR 2025 February Submission8356 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have become an integral part to the daily life for hundreds of millions of users. They are commonly consulted on everyday ethical scenarios and it is crucial to ensure their alignment to human moral standards. In this paper, we propose MORALPSYCHBENCH, a benchmark featuring high-quality moral score prediction tasks from psychology literature. Our experiments show that these challenges remain difficult for a wide range of recent LLMs, including LLaMA-3-70B-Instruct, Mixtral 8$\times$22B, GPT-3.5-Turbo, GPT-4o, and even o3-mini. We then propose moral bottleneck models (MBMs), an effective and interpretable computational framework to enhance LLMs in complex moral evaluations. MBMs consistently improve all of the mentioned LLMs, reducing their average mean squared error by 65\% (from 2.88 to 1.00 on the scale of -4 to 4) on the benchmark. Further analyses indicate that MBMs can be flexibly instantiated with multiple moral theory bottlenecks and architectures. We hope our solution and findings spur more studies toward safe and ethical LLM applications.

Paper Type: Short

Research Area: NLP Applications

Research Area Keywords: LLM, Morality, Bottleneck

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 8356

Loading