Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Yusuke Sakai; Adam Nohejl; JIANGNAN HANG; Hidetaka Kamigaito; Taro Watanabe

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Yusuke Sakai, Adam Nohejl, JIANGNAN HANG, Hidetaka Kamigaito, Taro Watanabe

Published: 21 Sept 2024, Last Modified: 06 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Track: Full paper

Keywords: Large Language Model Evaluation, Instruction Template, Robustness Evaluation

TL;DR: Current NLU evaluations overlook prompt variance, causing unfair LLM comparisons. We propose using multiple templates and the Sharpe score to ensure fairer evaluation.

Abstract: The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.

Copyright PDF: pdf

Submission Number: 84

Loading