ECBD: Evidence-Centered Benchmark Design for NLPDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g.,which datasets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark’s measurements. To address this gap, we draw on evidence-centered design in educational assessments to propose ECBD (Evidence-Centered Benchmark Design). Our framework formalizes the benchmark design process into five modules and specifies the roles of each module and their interplay in collecting the evidence necessary to support the benchmark’s measurement. We demonstrate the use of ECBD by conducting case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks’ measurements.
Paper Type: long
Research Area: Special Theme (conference specific)
Contribution Types: Data analysis, Position papers, Theory
Languages Studied: English
0 Replies

Loading