Beyond Static Benchmarks: A Validity, Reliability, and Sociotechnical Framework for Evaluating LLMs in Deployment Contexts

Published: 29 Apr 2026, Last Modified: 10 May 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: LLM evaluation, benchmarks, deployment, validity, reliability, sociotechnical alignment, evaluation protocols, simulation, staged evaluation
TL;DR: VRS-Eval ties deployment validity, perturbation stability, and stakeholder rubric alignment to reportable quantities, then stress-tests benchmark-only evaluation against a staged pipeline in a controlled simulator.
Abstract: Static leaderboards summarize large language model (LLM) performance but offer weak evidence under shifting usage, noisy inputs, and plural stakeholder values. We present VRS-Eval, operationalizing deployment validity (benchmark vs. deployment score alignment), operational reliability (stability under a declared perturbation family), and sociotechnical alignment (metric vs. elicited rubric weights as a thin audit summary). With a reproducible simulator under explicit $P_B$ vs. $P_D$ shift and multi-turn interaction, we stress-test evaluation protocols in a controlled environment: under our main setting, benchmark-side scores (on $P_B$) exceed estimated deployment-side utility scores (evaluated on trajectories from $P_D$) by roughly 21–26% in relative terms across three metrics, with tight 95% percentile intervals ($K=200$). Failure mixtures emphasize overfitting, shift fragility, and rubric misalignment, consistent with first- vs. third-party reporting asymmetries. A staged pipeline narrows the validity gap and raises reliability for the same generative story. Sensitivity sweeps over $|\Omega|$ and rubric-label rate preserve the rank ordering of harnesses, suggesting the qualitative conclusions are robust to plausible design-choice variation within the simulator. We discuss the harness and accountability implications.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Research Paper
Archival Status: Archival
Submission Number: 64
Loading