,,BBQ,TruthfulQA,BOLD,MMLU,ARC Challenge,WinoGrande,Codex HumanEval,GSM8k*,HellaSwag,Machiavelli,AgentBench,MLCommons AI Safety v0.5,MMMU,GPQA,BIG-bench,DecodingTrust,Procgen,MedMNIST v2,Wordcraft,RL Unplugged,FinRL-Meta,SafeBench,PDEBench,ALE,MuJoCo,Averages,
Benchmark Design,Tested capability or concept is defined,15,15,15,10,15,10,15,15,15,15,15,15,15,15,5,15,15,15,0,15,10,15,15,15,15,13.4,
,How tested capability or concept translates to benchmark task is described,15,15,15,0,15,15,15,15,0,15,15,15,10,15,0,15,10,15,15,15,15,15,15,15,15,12.8,
,"How knowing about the tested capability or concept is helpful in the real world is described
",15,15,15,0,10,10,15,10,0,15,15,15,10,15,10,10,0,10,15,15,15,15,15,10,15,11.6,
,"How benchmark scores should or shouldn't be interpreted or used is described
",10,15,10,0,15,15,15,10,0,10,15,15,10,10,5,5,0,15,0,10,10,15,10,10,10,9.6,
,Domain experts are involved,10,15,10,0,5,10,15,10,0,10,10,15,15,15,10,15,15,15,0,15,15,15,15,10,10,11.0,
,Use cases or user personas are described,0,0,0,0,10,0,10,10,0,5,15,15,0,0,0,10,0,10,10,15,0,15,15,15,10,6.6,
,Domain literature is integrated,10,10,0,0,15,15,15,10,0,15,15,15,10,0,10,15,10,10,10,15,10,15,15,15,10,10.6,
,Informed performance metric choice,15,15,10,15,15,15,15,15,15,15,15,15,10,15,15,15,15,15,15,10,15,15,15,15,10,14.2,
,Metric floors and ceilings are included,15,15,0,15,15,15,15,15,15,15,10,15,10,15,10,15,10,15,15,10,10,10,10,10,10,12.4,
,Human performance level is included,15,15,n/a,10,15,15,0,15,15,10,5,n/a,15,15,15,0,0,0,15,0,0,0,n/a,15,0,8.6,
,Random performance level is included,15,n/a,0,15,n/a,0,n/a,0,10,15,n/a,n/a,15,15,15,n/a,n/a,0,n/a,0,0,n/a,n/a,15,0,7.7,
,Has validated automatic evaluation,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,10,14.8,
,Differences to related benchmarks are explained,15,15,15,15,0,10,15,15,10,15,15,5,15,15,10,15,15,15,10,15,10,15,15,10,0,12.2,
,Input sensitivity is addressed,15,10,0,0,0,n/a,0,0,0,n/a,10,15,0,0,10,15,15,0,n/a,15,n/a,15,15,15,10,7.6,
Benchmark Implementation,Evaluation code is available,10,15,0,15,15,15,15,10,15,15,15,15,15,15,15,15,15,15,15,15,10,15,10,15,15,13.6,
,Evaluation data or generation mechanism is accessible,10,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,10,10,15,15,15,15,14.4,
,Evaluation of models via API is supported,0,10,0,10,0,0,0,0,0,0,15,15,0,15,15,15,0,n/a,n/a,n/a,n/a,n/a,n/a,n/a,15,6.1,
,Evaluation of local models is supported,0,0,0,0,15,15,15,10,15,15,15,15,10,0,10,15,10,10,0,15,15,15,15,15,15,10.4,
,A globally unique identifier is added or evaluation instances are encrypted,0,0,0,0,0,0,0,0,0,0,0,0,0,15,15,0,n/a,0,n/a,n/a,n/a,n/a,n/a,0,0,1.6,
,"Task to identify if model is 
trained on benchmark data is included",0,0,0,0,0,0,0,0,0,0,0,0,0,10,15,0,n/a,5,n/a,n/a,n/a,n/a,n/a,10,0,2.1,
,Script to replicate results is included,0,0,0,0,0,0,0,0,0,15,0,0,0,15,0,0,10,15,0,0,0,0,10,15,10,3.6,
,Statistical significance or uncertainty quantification of benchmark results is reported,0,0,0,0,0,15,10,15,0,0,0,0,0,0,10,0,15,0,10,15,0,0,15,15,15,5.4,
,Need for warnings for sensitive/harmful content is assessed,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,15,0,0,0,0,10,0,0,0,0,1.4,
,Build status (or equivalent) is implemented,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,
,Release requirements are specified,0,0,0,0,0,0,0,0,0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,15,15,1.8,
Benchmark Documentation,Requirements file or equivalent is available,0,15,0,0,15,15,15,10,15,15,15,0,0,15,15,15,15,15,15,15,15,15,15,15,15,11.8,
,Quick-start guide or demo is available,0,15,10,0,10,15,15,0,0,15,15,15,10,15,10,15,15,15,10,15,15,15,15,15,15,11.6,
,In-line code comments are used,10,10,n/a,10,10,10,10,10,10,10,10,0,0,10,0,10,10,10,10,15,10,15,10,15,10,9.4,
,Code documentation is available,15,0,n/a,0,10,10,10,10,0,15,10,10,0,0,10,15,10,15,0,10,15,15,15,15,10,9.2,
,Accompanying paper is accepted at peer-reviewed venue,15,15,15,15,0,15,0,0,15,15,15,0,0,0,0,15,15,15,15,15,15,15,15,15,0,10.2,
,Benchmark construction process is documented,15,10,10,10,15,15,15,15,10,15,15,15,10,15,10,15,10,15,15,10,15,15,15,15,15,13.4,
,Test tasks & rationale are documented,15,10,15,10,15,15,15,15,10,15,15,15,15,15,10,15,15,10,15,15,15,15,15,15,15,14.0,
,Assumptions of normative properties are documented,15,0,10,0,15,10,15,10,0,10,5,15,0,n/a,0,10,0,n/a,0,n/a,n/a,n/a,n/a,n/a,n/a,6.8,
,Limitations are documented,15,15,0,0,15,10,15,10,0,15,10,15,10,15,0,15,0,0,0,0,15,10,10,15,0,8.4,
,"Data collection, test environment design, or prompt design process is documented
",15,15,10,10,15,15,15,15,15,15,15,15,10,15,15,15,10,15,15,15,15,15,15,15,15,14.2,
,Evaluation metric is documented,15,10,10,15,15,15,15,15,15,15,15,15,10,15,10,15,15,15,10,10,15,15,15,15,15,13.8,
,Applicable license is specified,15,15,15,15,15,15,15,15,15,15,0,15,15,15,15,15,15,15,0,15,15,15,15,15,15,13.8,
Benchmark Maintenance,Code usability was checked within the last year,15,0,15,0,0,0,0,0,0,15,15,15,15,15,10,15,0,15,0,0,15,0,15,15,15,8.2,
,Maintained feedback channel for users is available,10,0,0,0,0,0,0,0,10,10,15,0,15,0,0,15,0,15,15,0,10,0,15,15,10,6.2,
,Contact person is listed,0,15,0,0,0,15,0,0,15,10,15,15,15,10,15,15,15,15,15,0,0,15,15,10,0,9.0,
,Total Score,9.3,9.1,6.2,5.5,8.7,9.7,9.5,8.3,6.8,11.2,10.6,10.9,7.9,10.3,8.6,11.5,8.9,10.4,8.4,9.9,10.1,11.3,12.8,12.9,9.6,9.5,
,Design Score,12.9,13.1,8.1,6.8,11.2,11.2,12.3,11.1,6.8,13.1,13.1,14.2,10.7,11.4,9.3,12.3,9.2,10.7,10.0,11.8,9.6,13.5,14.2,13.2,8.9,11.1,18
,Implementation Score,1.8,3.6,1.4,3.6,4.1,5.5,5.0,4.5,4.1,5.5,5.5,7.7,3.6,7.7,8.6,6.8,7.2,6.0,5.0,6.9,5.6,5.6,8.1,10.0,9.1,5.7,1
,Documentation Score,12.1,10.8,9.5,7.1,12.5,13.3,12.9,10.4,8.8,14.2,11.7,10.8,6.7,11.8,7.9,14.2,10.8,12.7,8.8,12.3,14.5,14.5,14.1,15.0,11.4,11.6,19
,Maintenance Score,8.3,5.0,5.0,0.0,0.0,5.0,0.0,0.0,8.3,11.7,15.0,10.0,15.0,8.3,8.3,15.0,5.0,15.0,10.0,0.0,8.3,5.0,15.0,13.3,8.3,7.8,9
,Design Score,12.9,13.1,8.1,6.8,11.2,11.2,12.3,11.1,6.8,13.1,13.1,14.2,10.7,11.4,9.3,12.3,9.2,10.7,10.0,11.8,9.6,13.5,14.2,13.2,8.9,11.1,18
,Usability Score,7.3,7.1,5.2,4.8,7.5,9.0,8.1,6.7,6.7,10.2,9.4,9.4,6.3,9.6,8.3,11.2,8.8,10.2,7.6,8.6,10.5,10.0,12.0,12.7,10.0,8.7,8
,Benchmark Rating,**,**,*,*,**,**,**,**,*,***,**,**,**,**,*,***,*,***,**,**,**,***,***,***,**,,
,# of non-NA design criteria,14,13,13,14,13,13,13,14,14,13,13,12,14,14,14,13,13,14,12,14,13,13,12,14,14,,
,# of non-NA implementation criteria,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,9,10,8,8,8,8,8,10,11,,
,# of non-NA documentation criteria,12,12,10,12,12,12,12,12,12,12,12,12,12,11,12,12,12,11,12,11,11,11,11,11,11,,
,# of non-NA maintenance criteria,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,,