Keywords: AI safety, pluralistic alignment
Abstract: Current AI safety frameworks often treat harmfulness as a binary classification problem, overlooking the nuanced nature of safety judgments in borderline cases where meaningful disagreement exists. To navigate the full spectrum of human harm judgments, safe AI systems must also distinguish between areas of consensus and disagreement rooted in diverse values and perspectives. Here, we present a novel approach for systematically studying human harm judgment for AI safety by synthesizing prompts across an ordinal harm spectrum (0.0 to 1.0) and using interpretable harm and value features to curate a prompt dataset that emphasizes borderline cases and feature diversity. We applied this procedure to produce a dataset of 150 prompts that were annotated by 10 human raters in a pilot study. We found that human harmfulness ratings strongly correlate with our synthetic harm levels (r = 0.54, p < 0.001), with intermediate harm levels exhibiting higher annotator disagreement. Mixed-effects regression analysis reveals that violation of freedom and fundamental rights is positively correlated with human harmfulness perception on the population level. These findings provide a principled framework for grounding AI safety alignment in systematic understanding of where human harm judgments converge and diverge.
Submission Number: 56
Loading