Quantitative LLM Judges

TMLR Paper9216 Authors

26 May 2026 (modified: 29 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of LLM judges to humans in a given domain using regression models. These models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework can be applied to proprietary models and when human feedback is limited, which is expected in practice. We validate our claims empirically on four datasets. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=5oatWiWCQ3
Changes Since Last Submission: Desk rejected because of wrong font. This was caused by \usepackage{times} and we commented it out.
Assigned Action Editor: ~Xuanjing_Huang1
Submission Number: 9216
Loading