Abstract: As LLMs advance, evaluating generated text reliably becomes more challenging due to the high costs of human evaluation. To make progress toward better LLM autoraters, we introduce FLAME, a family of Foundational Large Autorater ModEls. FLAME is trained on our large and diverse collection of nearly 100 quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAME significantly improves generalization to a wide variety of held-out tasks, outperforming proprietary LLMs like GPT-4 and Claude on many tasks. Additionally, we show that our FLAME multitask mixture can be further optimized for specific downstream applications, e.g., reward modeling evaluation, through a novel tail-patch fine-tuning technique. Notably, on RewardBench, our model (86.7) is the top-performing generative model trained solely on permissively licensed data, outperforming both GPT-4-0125 (85.9) and GPT-4o (84.7). Our analysis reveals that FLAME is significantly less biased than popular LLM-as-a-Judge models on the CoBBLEr cognitive bias benchmark, while effectively identifying high-quality responses for code generation. We release our FLAME data collection at this http URL.
Paper Type: Long
Research Area: Generation
Research Area Keywords: pre-training, prompting, applications, robustness, fine-tuning, multi-task learning, human evaluation, automatic evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 1833
Loading