Keywords: Human-aligned evaluation, Single-pass evaluation pipeline, supervised calibration, LLM-RUBRIC, FELIX, SummEval
TL;DR: SAJA is a single-call LLM-as-a-judge: one rubric prompt yields multi-dimensional scores, then a lightweight calibrator (optionally conformal intervals) aligns outputs to human judgments.
Abstract: LLM-as-a-Judge systems are increasingly used to evaluate text at scale, yet production deployment demands low latency, minimal cost, and compatibility with closed-source APIs. Current approaches fall short in different ways: some require many LLM calls and per-dataset prompt tuning, others depend on logit access unavailable in commercial APIs, and yet others demand multiple rounds of LLM interaction for iterative feature discovery. We present **SAJA** (**S**imple **A**pproach to **J**udge **A**lignment), built on the principle that task-specific alignment should reside in a lightweight calibration head, not in elaborate prompts or model internals. SAJA makes exactly one LLM call per item using a fixed structured rubric prompt, extracts a multi-dimensional feature vector, and maps it to a human-aligned score via a calibration head trained on a small number of human labels. No iterative prompt search, no logit access, and no multi-round LLM interaction are needed. Yet SAJA matches far more complex systems across four evaluation paradigms: 86% F1 on MT-Bench pairwise preference (vs. 78\% uncalibrated), competitive performance on five classification benchmarks with a single call, and +5.71% F1 over prompt-optimized baselines on proprietary data. Ablations confirm that multi-dimensional rubric features outperform one-dimensional calibration (SummEval $\rho$ improves from $0.60$ to $0.74$) and that coarse rubric outputs recover the same human alignment as full logit distributions ($\rho = 0.36$ vs. $0.37$), establishing that logit access is unnecessary for calibrated judge alignment. Moreover, SAJA is model-agnostic: a 9B open-source model with SAJA ($\rho{=}0.70$) surpasses raw GPT-4.1 ($\rho{=}0.60$). Its single-call design yields up to 4.8$\times$ cost savings over per-question approaches.
Submission Type: Emerging
Copyright Form: pdf
Submission Number: 141
Loading