FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Seonghyeon Ye; Doyoung Kim; Sungdong Kim; Hyeonbin Hwang; Seungone Kim; Yongrae Jo; James Thorne; Juho Kim; Minjoon Seo

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo

Published: 28 Oct 2023, Last Modified: 26 Nov 2023Instruction Workshop @ NeurIPS 2023EveryoneRevisionsBibTeX

Keywords: large language models, language model evaluation, natural language processing

TL;DR: We introduce fine-grained language model evaluation based on alignment skill sets to measure the performance of various LLMs.

Abstract: Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations.

Submission Number: 52

Loading