Keywords: Large Language Models (LLMs), AI Safety, Moral Reasoning, Prompt Engineering, Adversarial Robustness, Jailbreaking, Alignment, Chain-of-Thought, Evaluation Metrics, Calibration, Few-Shot Learning, Refusal Behavior
Abstract: Prompt design significantly impacts the moral competence and safety alignment of large language models (LLMs), yet empirical comparisons remain fragmented across datasets and models.
We introduce ProMoral-Bench, a unified benchmark evaluating 11 prompting paradigms across four LLM families. Using ETHICS, Scruples, WildJailbreak, and our new robustness test, ETHICS-Contrast, we measure performance via our proposed Unified Moral Safety Score (UMSS), a metric balancing accuracy and safety. Our results show that compact, exemplar-guided scaffolds outperform complex multi-stage reasoning, providing higher UMSS scores and greater robustness at a lower token cost. While multi-turn reasoning proves fragile under perturbations, few-shot exemplars consistently enhance moral stability and jailbreak resistance. ProMoral-Bench establishes a standardized framework for principled, cost-effective prompt engineering. Code and data are available at https://anonymous.4open.science/r/ProMoral_Bench-FFB4/README.md.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Safety and alignment, prompting, ethical considerations in NLP applications, benchmarking, robustness, adversarial attacks/examples/training, chain-of-thought, red teaming
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 8039
Loading