Quadruple-Slit Experiment: Reliability Issues in Multiple-Choice Evaluation for Language ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We uncover three intrinsic issues ingrained in current multiple-choice evaluation methods that negatively impact reliability.
Abstract: Multiple-choice evaluation has been commonly used for assessing language model capabilities. Current evaluation methods primarily employ a probability comparison approach. However, our study demonstrates overlooked reliability issues with this approach. The deterministic prediction comes at the cost of sacrificing core properties of multiple-choice questions—--order invariance, position independence and length independence. To perform reliability checking, we propose a test consistency checking method inspired by the double-slit experiment. Experiments across multiple LLMs and benchmarks reveal the shaky reliability of current implementations, uncovering severe position and length biases unintentionally introduced by these evaluation methods.
Paper Type: long
Research Area: Resources and Evaluation
Languages Studied: English
0 Replies

Loading