CertJudge: Evaluating Lean Formal-Code With Falsifiable Properties

Ethan S Hersch; Brando Miranda; Elyas Obbad; Srivatsava Daruru; Zhanke Zhou; Kirill Acharya; Sanmi Koyejo

CertJudge: Evaluating Lean Formal-Code With Falsifiable Properties

Ethan S Hersch, Brando Miranda, Elyas Obbad, Srivatsava Daruru, Zhanke Zhou, Kirill Acharya, Sanmi Koyejo

Published: 17 Jun 2026, Last Modified: 21 Jun 2026ICML 2026 AI4Math Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-Judge, Code Evaluation, Human-Centered Coding Agents, Formal Methods, Benchmarking and Evaluation

TL;DR: We propose a property-certification protocol for LLM code judges, showing that a variance-aware Trust Index predicts human alignment and supports scalable evaluation of AI-generated Lean 4 specifications.

Abstract: LLM judges are increasingly used to evaluate AI-generated code, tests, theorems, and formal specifications. In our setting, a judge scores the quality of a candidate Lean 4 artifact relative to a gold reference, but validating each new judge, prompt, model, or threshold with fresh human ratings is expensive and does not scale. We ask whether this validation can be amortized through behavioral checks that are automatic and falsifiable. We propose CertJudge, a property-certification protocol for LLM judges of Lean~4 theorem and specification quality. Instead of treating human agreement as the only validation signal, CertJudge tests whether a judge behaves correctly under controlled perturbations measuring identity, bug monotonicity, specification monotonicity, and stability. These diagnostics are combined into a variance-aware trust index, TI_var which provides a summary of judge reliability rather than a universal evaluator constant. On a human-labeled validation suite, TI_var strongly predicts judge-level human alignment, reporting validation-set Spearman correlations of 0.833 (avg labels), 0.905 (pass1), and 0.738 (pass2). On VeriBench, we use the calibrated judge to rank theorem-generation methods, illustrating how a small human calibration set can support scalable judge selection. These results are validation-set calibration evidence, not a claim of universal reliability or deployment-time generalization.

Submission Number: 109

Loading