Between Rigor and Reality: How AI Safety Benchmark Developers Understand Benchmark Use and Maintenance

Megan Li; Ningjing Tang; Christopher M Stewart; Hoda Heidari; Hong Shen

Between Rigor and Reality: How AI Safety Benchmark Developers Understand Benchmark Use and Maintenance

Megan Li, Ningjing Tang, Christopher M Stewart, Hoda Heidari, Hong Shen

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: generative AI, safety evaluations, evaluation methods, benchmarks

TL;DR: We explore how AI safety benchmark developers perceive the gap between intended and actual usefulness of safety benchmarks, and the development challenges that contribute to it.

Abstract: Safety benchmarks are cited as the primary tool to understand the risks posed by generative AI (genAI) systems, yet growing evidence suggests they often fail to meet the needs of real-world safety evaluation. We present findings from five interviews with AI safety benchmark developers in academia and industry. We find a disconnect between perceptions of the intended usefulness of safety benchmarks and their practical usefulness in safety evaluations of deployed systems. Our participants offered perspectives on two challenges contributing to this disconnect: (1) the difficulty of achieving inter-rater reliability on safety constructs, and (2) a lack of clarity regarding how to address persistent threats to external validity. Based on our analysis, we argue that safety benchmarks must not only be grounded in deployment contexts, but actively integrated with them. To facilitate this, we outline pathways for more transparent communication between academic and industry AI safety stakeholders.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Type: Research Paper

Archival Status: Non-archival

Submission Number: 76

Loading