Benchmarks as Microscopes: A Call for Model Metrology

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: Evaluation
Keywords: position paper, benchmarks, capabilities, measurement
TL;DR: A position paper advocating that current benchmarks of LLM are inadequate both for understanding model capabilities and or assessing their use in deployment, and proposing a subfield (model metrology) focused on studying how to make good benchmarks..
Abstract: Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of *model metrology*---one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners---one focused on building tools and studying how to measure system capabilities---is the best way to meet these needs to and add clarity to the AI discussion.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1185
Loading