Encoding Values: Injecting Morality into Machines via Prompt-Conditioned Moral Frames

Published: 27 Sept 2025, Last Modified: 09 Nov 2025NeurIPS Creative AI Track 2025EveryoneRevisionsBibTeXCC BY 4.0
Track: Paper
Keywords: Large Language Models (LLMs), Moral Value Alignment, AI Ethics & Safety, Creative AI, Responsible AI
Abstract: Large language models (LLMs) are typically aligned to a single, universal policy, obscuring the rich plurality of human moral perspectives that drive creative practice. We present a study of prompt-level moral steering in large language models with ten-principle constitutions. Seven constitutions matched in length (Intersectional Feminist, Ecological Justice, Ubuntu, Indigenous Sovereignty, Universal Human Rights, Neutral, and a Random Adjective placebo) are paired with a 100-question benchmark spanning Everyday Advice, Policy Scenarios, Normative Dilemmas, and Rewrite Tasks. We generated 2800 completions across four models (GPT-4o-mini, Llama-3.1-8B, Llama-4 Scout-17B, Qwen-3-235B) and evaluate them with four complementary metrics: toxicity, semantic alignment with the canon of each constitution, lexical marker ratio, and an LLM-as-a-Judge composite of authenticity, helpfulness, and safety. Prompted moral frames yield consistent improvements in alignment and judged quality without increasing toxicity; placebo prompts do not. Effects replicate across models and are strongest on morally charged tasks. The open-source toolkit lets researchers or artists author new frames in minutes, supporting participatory, culturally adaptive AI. Our results show that pluralistic prompting is a practical lever for value-conditional behavior and a fertile instrument for creative AI exploration. The code and results obtained in this study is available at: https://github.com/ArjunBalaji79/encoding_values_neurips
Submission Number: 91
Loading