Quantitative LLM Judges

Aishwarya Sahoo; Jeevana Kruthi Karnuthala; Tushar Parmanand Budhwani; Pranchal Agarwal; Sankaran Vaidyanathan; Alexa Siu; Franck Dernoncourt; Jennifer Healey; Nedim Lipka; Ryan A. Rossi; Uttaran Bhattacharya; Branislav Kveton

Quantitative LLM Judges

Aishwarya Sahoo, Jeevana Kruthi Karnuthala, Tushar Parmanand Budhwani, Pranchal Agarwal, Sankaran Vaidyanathan, Alexa Siu, Franck Dernoncourt, Jennifer Healey, Nedim Lipka, Ryan A. Rossi, Uttaran Bhattacharya, Branislav Kveton

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-judge, LLM evaluation, human feedback, alignment, classic machine learning models

TL;DR: We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain using classic machine learning models.

Abstract: LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain using regression models. The models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in practice. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 15693

Loading