Neuro-Symbolic Models of Human Moral Judgment: LLMs as Automatic Feature Extractors

Published: 23 Jun 2023, Last Modified: 10 Jul 2023DeployableGenerativeAIEveryoneRevisions
Keywords: moral cognition, moral machines, AI safety, neuro-symbolic
Abstract: As AI systems gain prominence in society, concerns about their safety become crucial to address. There have been repeated calls to align powerful AI systems with human morality. However, attempts to do this have used black-box systems that cannot be interpreted or explained. In response, we introduce a methodology leveraging the natural language processing abilities of large language models (LLMs) and the interpretability of symbolic models to form competitive neuro-symbolic models for predicting human moral judgment. Our method involves using LLMs to extract morally-relevant features from a stimulus and then passing those features through a cognitive model that predicts human moral judgment. This approach achieves state-of-the-art performance on the MoralExceptQA benchmark, improving on the previous F1 score by 20 points and accuracy by 18 points, while also enhancing model interpretability by baring all key features in the model's computation.
Submission Number: 27
Loading