RESpecBench: How reliable is LLM-as-a-judge? Rigorous Evaluation of Specification Generation with Automated Verification

RESpecBench: How reliable is LLM-as-a-judge? Rigorous Evaluation of Specification Generation with Automated Verification

ICLR 2026 Conference Submission25089 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-judge, reliability, specification, automated verification

TL;DR: We introduce a benchmark with sound automated verification for specification generation, and show that LLM-as-a-judge substantially overestimates correctness and is insufficient for reliable evaluation.

Abstract: Large Language Models (LLMs) are increasingly used to assist formalization of natural language statements into formal specifications. Unlike syntax correctness, validating semantic correctness is particularly challenging and LLM-as-a-Judge has become the dominant assessment methodology due to its ease of use and great flexibility. However, the reliability of LLM-as-a-Judge has rarely been systematically evaluated. We introduce $\texttt{RESpecBench}$, a multi-domain benchmark with a sound and automated verifier, measuring the LLM's ability to produce precise, semantically equivalent specifications from informal natural language descriptions. $\texttt{RESpecBench}$ spans five different domains, including Grade-School Math (GSM-Symbolic+), SQL, First-Order Logic (FOL), regular expressions (RegEx), and Rocq Prover tasks. We evaluate several state-of-the-art LLMs on $\texttt{RESpecBench}$ and compare our sound verifier to LLM-as-a-Judge pipelines, demonstrating that LLM-as-a-Judge produces unreliable verdicts and substantially overestimates specification correctness. $\texttt{RESpecBench}$ enables rigorous, automated, and sound evaluation of natural language into formal specification translation across multiple domains, ensuring formalized statements target the intended natural language properties.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 25089

Loading