Keywords: LLM-as-a-judge, pluralistic alignment, moral foundations theory, silent failure, audit-escape, hate speech evaluation
TL;DR: 92% (12/13) of Judge LLMs that look Aligned at the median still fail in at least one category—including 5 of 6 closed models, concentrated in Politics. Alignment is a relational property of the model-context pair, not the model alone.
Abstract: A Judge LLM serves as a scalable evaluator of human moral judgment, but its scalar score hides whether a human-LLM gap reflects different moral dimensions or miscalibrated weighting of the same ones. In humans these two aspects—which concerns are relevant and how strongly one should respond—are co-activated during socialization and acquired together; in Judge LLMs they are shaped by different training stages. We define two complementary metrics on six moral axes within each target category: Moral Orientation Fit (MOF) for directional similarity between human and Judge response profiles, and Vector RMSE for axis-level magnitude differences. On a Measuring Hate Speech panel with 40 Judge LLMs, 50 target categories, and 522,292 observations, we show that high orientation together with low calibration error yields the smallest alignment gaps, and that orientation and calibration are tightly coupled in human annotators but more separable in Judge LLMs. This separability surfaces as silent failure: of judges that look Aligned at the median, 92% (12/13) still carry at least one Orientation-gap category, including 5 of 6 closed models, with failures concentrated in the Politics meta-category. Symmetrically, more than half (53%) of judges that look misaligned at the median still align with humans in at least one category. Alignment is therefore a relational property of the model-context pair rather than an intrinsic model attribute, with direct implications for benchmark design and audit granularity. The resulting diagnostic distinguishes differences in moral evidence from errors in response strength and supports axis-resolved auditing and context-aware model selection.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 41
Loading