Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: LLM-as-a-Judge, AI Judge Reliability, Evaluation Reliability, Perturbation-Based Evaluation, Synthetic Test Generation
TL;DR: Judge Reliability Harness is an open-source tool that generates synthetic tests to evaluate AI judges. It measures consistency and sensitivity, showing that small changes in wording or formatting can significantly impact scores.
Abstract: We present the Judge Reliability Harness, an open source library for constructing synthetic validation suites that test the reliability of AI judges (also referred to as LLM judges or autograders). As AI-judge-based scoring is widely deployed in AI benchmarks, more tooling may be needed to systematically assess judge behavior under realistic perturbations. Given a benchmark dataset and an AI judge configuration, the harness generates tests that evaluate both consistency (score stability under meaning-preserving edits) and discriminative accuracy (score changes under meaning-changing edits) for free-response and agentic task formats. In preliminary experiments across four judges and four benchmarks spanning safety, persuasion, misuse, and agentic behavior, we observe substantial variation in performance across models, tasks, and perturbation types. We do not observe a judge that is uniformly reliable across all tested settings, and superficial changes such as formatting, paraphrasing, and verbosity can induce failures. Code: https://github.com/RANDCorporation/judge-reliability-harness
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 93
Loading