Keywords: morality, value, alignment, semantics
Abstract: To predict how LLMs might behave, it is crucial to understand how much they value some moral virtues over others. We operationalize models’ values as a scalar over virtue concepts that denotes their relative importance and use several convergent measures to obtain this scalar. We then quantify the consistency of this measure across these methods. For sufficiently consistent models, we test if an aggregate measure of this scalar predicts model behavior on action selection tasks where virtues conflict. For the models tested (Llama-3, Gemini and GPT-4), we show that all models possess at least some inconsistencies across our convergent measures, and that moral representations of even the most consistent model do not map neatly onto its action choices in simple moral dilemmas.
Submission Number: 27
Loading