Abstract: As the impact of large language models increases, understanding the moral values they encode becomes ever more important. Assessing moral values encoded in these models via direct prompting is challenging due to potential leakage of human norms into model training data, and their sensitivity to prompt formulation. Instead, we propose to use word associations, which have been shown to reflect moral reasoning in humans, as low-level underlying representations to obtain a more robust picture of LLMs' moral reasoning. We study moral differences in associations from western English-speaking communities and LLMs trained predominantly on English data. First, we create a large dataset of $\textit{LLM-generated}$ word associations, resembling an existing data set of $\textit{human}$ word associations. Next, we propose a novel method to propagate moral values based on seed words derived from Moral Foundation Theory through the human and LLM-generated association graphs. Finally, we compare the resulting moral representations, highlighting detailed but systematic differences between moral values emerging from English speakers and LLM associations.
Paper Type: Long
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: Moral alignment, LLMs, Cognitive Modeling, Word association
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2109
Loading