Keywords: deployable AI, AI safety, moral reasoning, large language models, trustworthiness, fairness, explainability, LLaMA, alignment, ethical evaluation
TLDR: We introduce the Shadow Index, a framework that reveals hidden moral personas in large language models using multi-decode ethical stress testing across four LLaMA models.
Abstract: As artificial intelligence systems become increasingly sophisticated, understanding their latent moral behaviors becomes paramount for ensuring safe deployment. We introduce the Shadow Index, a novel framework for quantifying hidden moral personas in Large Language Models (LLMs) through systematic evaluation across multiple dimensions of ethical behavior. Our approach employs a comprehensive stimulus bank of ethical stress prompts, multi-decode evaluation protocols, and advanced statistical analysis to reveal the moral landscape of AI systems. Through extensive evaluation across four state-of-the-art LLaMA models, we demonstrate significant variations in moral stability and shadow intensity, revealing that model size does not necessarily correlate with moral performance. The Shadow Index provides a standardized methodology for AI safety evaluation, offering crucial insights for the development of more ethically aligned artificial intelligence systems.
Submission Number: 54
Loading