RESpecBench: How reliable is LLM-as-a-judge? Rigorous Evaluation of Specification Generation with Automated Verification
Keywords: LLM-as-a-judge, reliability, specification, automated verification
TL;DR: We introduce a benchmark with sound automated verification for specification generation, and show that LLM-as-a-judge substantially overestimates correctness and is insufficient for reliable evaluation.
Abstract: Large Language Models (LLMs) are increasingly used to assist formalization of natural language statements into formal specifications. Unlike syntax correctness, validating semantic correctness is particularly challenging and LLM-as-a-Judge has become the dominant assessment methodology due to its ease of use and great flexibility. However, the reliability of LLM-as-a-Judge has rarely been systematically evaluated. We introduce $\texttt{RESpecBench}$, a multi-domain benchmark with a sound and automated verifier, measuring the LLM's ability to produce precise, semantically equivalent specifications from informal natural language descriptions. $\texttt{RESpecBench}$ spans five different domains, including Grade-School Math (GSM-Symbolic+), SQL, First-Order Logic (FOL), regular expressions (RegEx), and Rocq Prover tasks. We evaluate several state-of-the-art LLMs on $\texttt{RESpecBench}$ and compare our sound verifier to LLM-as-a-Judge pipelines, demonstrating that LLM-as-a-Judge produces unreliable verdicts and substantially overestimates specification correctness. $\texttt{RESpecBench}$ enables rigorous, automated, and sound evaluation of natural language into formal specification translation across multiple domains, ensuring formalized statements target the intended natural language properties.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 25089
Loading