Keywords: language model evaluation, benchmarks, naturalistic evaluation, functional evaluation, verifiability, instruction following
TL;DR: We propose a small verifiability first standard for naturalistic functional evaluation of language models, illustrated with a schema constrained instruction following prototype and multilingual, resource aware measurement and governance protocols.
Abstract: Static leaderboards and single turn judgments correlate weakly with deployment outcomes, especially in multilingual and resource constrained settings. This position paper argues that credible evaluation hinges on verifiability: ex ante specifications that permit observable checks, repeatable scoring, and auditable evidence. We propose a minimal standard that makes verifiability first class while remaining compatible with existing workflows. The standard has four artifacts: a task schema, a validator entry point, a run card, and required reporting fields. We ground the proposal in prior work on coverage and transparency and on specification based checks. We present a prototype evaluation task for schema constrained instruction following with robustness probes and a multilingual protocol, and we attach measurement and governance procedures that link scores to validity arguments. The goal is to replace generic win rates with verifiable claims about task success that better predict real use across languages and contexts.
Submission Number: 25
Loading