Evaluating LLMs Adversarially with Word Guessing Game

ACL ARR 2024 June Submission840 Authors

13 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The increasing significance of evaluating large language models (LLMs) is addressed in this paper. We present a new evaluation framework, Adversarial Guessing Evaluation (AGE), designed for LLMs. AGE employs a systematic set of rules and metrics to evaluate reading comprehension abilities and confusion capabilities of LLMs across different dimensions. Our framework significantly reduces the need for large datasets, requiring only a few pairs of words. The results align with average outcomes from established comprehensive benchmarks and highlight areas for potential improvements in LLMs.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies,metrics
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: English
Submission Number: 840
Loading