Beyond Static Benchmarks: A Validity, Reliability, and Sociotechnical Framework for Evaluating LLMs in Deployment Contexts

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: LLM evaluation, benchmarks, deployment, validity, reliability, sociotechnical alignment, evaluation protocols, simulation, staged evaluation
TL;DR: VRS-Eval ties deployment validity, perturbation stability, and stakeholder rubric alignment to reportable quantities, then stress-tests benchmark-only evaluation against a staged pipeline in a controlled simulator.
Abstract: Static leaderboards summarize large language model (LLM) performance but offer weak evidence under shifting usage, noisy inputs, and plural stakeholder values. We present VRS-Eval, operationalizing deployment validity (benchmark vs. deployment score alignment), operational reliability (stability under a declared perturbation family), and sociotechnical alignment (metric vs. elicited rubric weights as a thin audit summary). With a reproducible simulator under explicit P_B vs. P_D shift and multi-turn interaction, we stress-test evaluation protocols in a controlled environment: under our main setting, benchmark-side scores (on P_B) exceed estimated deployment-side utility scores (evaluated on trajectories from P_D) by roughly 22–26% in relative terms across three metrics, with tight 95% percentile intervals (K=200). Failure mixtures emphasize overfitting, shift fragility, and rubric misalignment, consistent with asymmetric first- vs. third-party evaluation reporting. A staged pipeline narrows the validity gap and raises reliability for the same generative story; we discuss harness and accountability implications.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Research Paper
Archival Status: Archival
Submission Number: 64
Loading