MCQFormatBench: Robustness Tests for Multiple-Choice Questions

ACL ARR 2024 June Submission5908 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multiple-choice questions (MCQs) are often used to evaluate large language models (LLMs). They measure LLMs' general common sense and reasoning abilities, as well as their knowledge in specific domains such as medicine. However, the robustness of LLMs to a variety of question formats in MCQs has not been thoroughly evaluated. While there are studies on the sensitivity of LLMs to input variations, research into their responsiveness to different question formats is still limited. Therefore, in this study, we propose a method to construct tasks to comprehensively evaluate the robustness against format changes of MCQs by decomposing the answering process into several steps. Using this dataset, we evaluate six LLMs, such as Llama3-70B and Mixtral-8x7B. Consequently, the lack of robustness to differences in the format of MCQs becomes evident. It is crucial to consider whether the format of MCQs influences their evaluation scores when assessing LLMs using MCQ datasets.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: english
Submission Number: 5908
Loading