From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

ACL ARR 2024 August Submission276 Authors

15 Aug 2024 (modified: 04 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. LLM judges are typically evaluated by measuring the correlation with human judgments on generation tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that the used judges are mostly unable to improve task performance but are able to pick the better model. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance. We observe that judges tend to choose the model of higher quality even if its answer is incorrect. Further, we show that it is possible to use statistics, such as the task performances of the individual models, to predict judgment performance. In an ablation, we systematically perturb the candidate answers and observe that judges often keep the original answer, providing evidence that judges incorporate writing style in their judgments. In summary, we find that regularities in the judgments are quantifiable using statistical measures, and provide various angles on exploiting them.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: analysis,data shortcuts/artifacts
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 276
Loading