Causal Judge Evaluation (CJE): Auditable, Unbiased Off-Policy Metrics for LLM Systems via Design-by-Projection

19 Sept 2025 (modified: 22 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: off-policy evaluation, large language models, teacher forcing, isotonic calibration, importance sampling, doubly robust estimation, causal inference, evaluation methodology, preference models, RLHF/DPO, projection onto convex sets, diagnostics, effective sample size
TL;DR: A turn-key, auditable OPE system for LLMs that calibrates judges and weights via projection, boosting ESS from near-zero to ~80–95% and delivering tight, honest CIs with stacked DR estimators.
Abstract: **Causal Judge Evaluation (CJE)** recasts offline “LLM-as-judge” evaluation as calibrated off-policy estimation. For a candidate policy \\( \\pi' \\), the target is \\( V(\\pi') = E[Y(\\pi')] \\). From logs \\( (X,A,S) \\) under \\( \\pi_0 \\) we compute sequence-level ratios \\( W_{\\pi'} = \\pi'(A\\mid X)/\\pi_0(A\\mid X) \\) via teacher forcing. CJE applies one rule—**Design-by-Projection (DbP)**: encode justified knowledge as a closed convex set and project a valid object onto it: (i) calibrate rewards \\( R = f(S) \\) by a mean-preserving monotone map; (ii) stabilize ratios via a unit-mean, \\( S \\)-monotone projection (**SIMCal-W**); (iii) use nuisance-orthogonal estimating equations (OC-IPS / DR-CPO / TMLE-style targeting); and (iv) minimize variance by convex stacking in influence-function space. A **Knowledge–Riesz** result shows that intersecting the admissible IF class with justified convex knowledge preserves the estimand and weakly lowers the attainable variance; with cross-fitting, the resulting estimators attain the **surrogate information bound**. On Arena-derived logs (n = 4,989; five policies), SIMCal-W turns near-degenerate weights into stable ones (ESS: \\(0.6\\% \\to 94.6\\%\\)), calibrated DR recovers near-\\( \\sqrt{n} \\) scaling with well-calibrated CIs, and IF-stacking further improves ordering. CJE reports uncertainty that includes an oracle-slice jackknife and surfaces overlap/tail and judge-reliability diagnostics; when calibration support is limited, it returns rank-robust conclusions (REFUSE-LEVEL) rather than overconfident levels.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 21553
Loading