DHP Benchmark: Measuring Discernment Ability of LLM-as-a-Judge

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, LLM-as-a-Judge
TL;DR: We propose Discernment of Hierarchical Perturbation (DHP) benchmark, which provides quantitative discernment scores for LLM-as-a-Judge.
Abstract: Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks; this is often referred to as ``LLM-as-a-judge'' paradigm. However, the capabilities of LLMs in evaluating NLG quality remain underexplored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs. DHP systematically degrades reference texts at character (typos, deletions), word (grammatical errors, entity substitutions), and sentence levels (reordering, factual inconsistencies), then uses Wilcoxon Signed-Rank Tests to measure whether LLMs assign lower scores to perturbed texts. We benchmark 19 LLMs from 8 families (GPT, Llama, Qwen, Vicuna, Mistral) across 6 datasets spanning summarization, story completion, question answering, and translation tasks. Our results provide critical insight into their strengths and limitations as NLG evaluators.
Submission Number: 48
Loading