Abstract: Large language models (LLMs) have become an integral part to the daily life for hundreds of millions of users. They are commonly consulted on everyday ethical scenarios and it is crucial to ensure their alignment to human moral standards.
In this paper, we propose MORALPSYCHBENCH, a benchmark featuring high-quality moral score prediction tasks from psychology literature. Our experiments show that these challenges remain difficult for a wide range of recent LLMs, including LLaMA-3-70B-Instruct, Mixtral 8$\times$22B, GPT-3.5-Turbo, GPT-4o, and even o3-mini.
We then propose moral bottleneck models (MBMs), an effective and interpretable computational framework to enhance LLMs in complex moral evaluations.
MBMs consistently improve all of the mentioned LLMs, reducing their average mean squared error by 65\% (from 2.88 to 1.00 on the scale of -4 to 4) on the benchmark.
Further analyses indicate that MBMs can be flexibly instantiated with multiple moral theory bottlenecks and architectures.
We hope our solution and findings spur more studies toward safe and ethical LLM applications.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: LLM, Morality, Bottleneck
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 8356
Loading