Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

ACL ARR 2024 June Submission1833 Authors

15 Jun 2024 (modified: 09 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As LLMs advance, evaluating generated text reliably becomes more challenging due to the high costs of human evaluation. To make progress toward better LLM autoraters, we introduce FLAME, a family of Foundational Large Autorater ModEls. FLAME is trained on our large and diverse collection of nearly 100 quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAME significantly improves generalization to a wide variety of held-out tasks, outperforming proprietary LLMs like GPT-4 and Claude on many tasks. Additionally, we show that our FLAME multitask mixture can be further optimized for specific downstream applications, e.g., reward modeling evaluation, through a novel tail-patch fine-tuning technique. Notably, on RewardBench, our model (86.7) is the top-performing generative model trained solely on permissively licensed data, outperforming both GPT-4-0125 (85.9) and GPT-4o (84.7). Our analysis reveals that FLAME is significantly less biased than popular LLM-as-a-Judge models on the CoBBLEr cognitive bias benchmark, while effectively identifying high-quality responses for code generation. We release our FLAME data collection at this http URL.

Paper Type: Long

Research Area: Generation

Research Area Keywords: pre-training, prompting, applications, robustness, fine-tuning, multi-task learning, human evaluation, automatic evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 1833

Loading