Eval Task,Task Category,Task Type,#shots,#datapoints,Random baseline,Centered Metric?,Description
mmlu_zeroshot,world knowledge,multiple choice,0,14042,25,,"MMLU consists of 14,042 four-choice multiple choice questions distributed across 57 categories. The questions are in the style of academic standardized tests and the model is provided the question and the choices and is expected to choose between A, B, C, and D as its outputs. The subjects range from jurisprudence, to math, to morality."
hellaswag_zeroshot,language understanding,multiple choice,0,10042,25,,"HellaSwag consists of 10,042 multiple choice scenarios in which the model is prompted with a scenario and choose the most likely conclusion to the scenario from four possible options."
jeopardy,world knowledge,language modeling,10,2117,0,,"Jeopardy consists of 2,117 Jeopardy questions separated into 5 categories: Literature, American History, World History, Word Origins, and Science. The model is expected to give the exact correct response to the question."
triviaqa_sm_sub,world knowledge,question answering,3,3000,0,,"Trivia QA is a question answering dataset that assesses the model's ability to produce free-response short answers to trivia questions. We've subsampled it to contain 3,000 questions and we've clipped all answers to be at most 10 tokens long in order to improve speed."
gsm8k_cot,symbolic problem solving,question answering,3,1319,0,,"GSM8K consists of 1,319 short, free-response grade school-level arithmetic word problems with simple numerical solutions. The model is prompted to use chain-of-thought reasoning before giving a final answer."
agi_eval_sat_math_cot,symbolic problem solving,question answering,3,220,0,,"AGI Eval SAT Math consists of 220 short, free-response SAT math questions in which the model is prompted to use chain-of-thought reasoning to produce properly formatted latex solutions to the problems."
aqua_cot,symbolic problem solving,question answering,3,245,0,,"AQUA-RAT (Algebra Question Answering with Rationales) consists of 245 simple arithmetic short, free-response word problems in which the model is prompted to use chain-of-thought reasoning to produce correct numerical and algebraic solutions. Unlike the original dataset which is framed as a multiple choice test, we prompt the model to produce the exact correct answer without seeing any available chocies."
svamp_cot,symbolic problem solving,question answering,3,300,0,,"SVAMP consists of 300 short, free-response grade school-level arithmetic word problems with simple numerical solutions. The model is prompted to use chain-of-thought reasoning before giving a final answer"
bigbench_qa_wikidata,world knowledge,language modeling,10,20321,0,,"BIG-bench wikidata consists of 20,321 questions regarding factual information pulled from wikipedia. Questions range from the native language of celebrities to the country that different regions belong to. Models are given a sentence such as “The country of citizenship of Barack Obama is” and are expected to complete the sentence with e.g. “the United States.”"
arc_easy,world knowledge,multiple choice,10,2376,25,,"ARC easy consists of 2,376 easy four-choice multiple choice science questions drawn from grade 3-9 science exams. The questions rely on world knowledge related to basic science."
arc_challenge,world knowledge,multiple choice,10,2376,25,,"ARC easy consists of 2,376 easy four-choice multiple choice science questions drawn from grade 3-9 science exams. The questions rely on scientific world knowledge and some procedural reasoning."
mmlu_fewshot,world knowledge,multiple choice,5,14042,25,,"MMLU consists of 14,042 four-choice multiple choice questions distributed across 57 categories. The questions are in the style of academic standardized tests and the model is provided the question and the choices and is expected to choose between A, B, C, and D as its outputs. The subjects range from jurisprudence, to math, to morality."
bigbench_misconceptions,world knowledge,multiple choice,10,219,50,,"Big bench misconceptions consists of 219 true or false questions regarding common misconceptions about a variety of topics including urban legends, stereotypes, basic science, and law."
copa,commonsense reasoning,multiple choice,0,100,50,,"COPA consists of 100 cause/effect multiple choice questions in which the model is prompted with a premise and the model must choose correctly between two possible causes/effects of the premis"
siqa,commonsense reasoning,multiple choice,10,1954,33.3,,Social Interaction QA consists of 1954 two-choice multiple choice questions that test a model's ability to draw emotional and social conclusions about the participants in everyday situations.
commonsense_qa,commonsense reasoning,multiple choice,10,1221,20,,"Commonsense QA consists of 1,221 four-choice multiple choice questions that rely on very basic commonsense reasoning about everyday items."
piqa,commonsense reasoning,multiple choice,10,1838,50,,"PIQA consists of 1,838 commonsense physical intuition 2-choice multiple choice questions"
openbook_qa,commonsense reasoning,multiple choice,0,500,25,,OpenBook QA consists of 500 four-choice multiple choice questions that rely on basic physical and scientific intuition about common objects and entities.
bigbench_novel_concepts,commonsense reasoning,multiple choice,10,32,20,,BIG-bench novel concepts consists of 32 find-the-common-concept problems in which the model is given 3 words and has to choose from among 4 possible concepts that they all have in common.
bigbench_strange_stories,commonsense reasoning,multiple choice,10,174,50,," BIG-bench strange stories consists of 174 short stories followed by a two-choice multiply choice question in which the model is asked to make commonsense inferences about the characters in the stories, how they might feel, and why they act in certain ways."
bigbench_strategy_qa,commonsense reasoning,multiple choice,10,2289,50,,"BIG-bench strategy QA consists of 2,289 very eclectic yes/no questions on a wide range of commonsense subjects, e.g “Can fish get Tonsilitis?”"
lambada_openai,language understanding,language modeling,0,5153,0,,"LAMBADA consists of 5,153 passages take from books. The model is expected to read the first N-1 words of each passage and predict the final token."
hellaswag,language understanding,multiple choice,10,10042,25,,"HellaSwag consists of 10,042 multiple choice scenarios in which the model is prompted with a scenario and choose the most likely conclusion to the scenario from four possible options."
winograd,language understanding,schema,0,273,50,," The Winograd Schema Challenge consists of 273 scenarios in which the model must use semantics to correctly resolve the anaphora in a sentence. Two possible beginnings to a sentence are presented as well as an ending. Both involve some anaphora being resolved in a different way, only one of which would be semantically valid, and the model must choose which option produces the valid resolution."
winogrande,language understanding,schema,0,1267,50,,"The Winogrande consists of 1,267 scenarios in which two possible beginnings of a sentence are presented along with a single ending. Both combinations are syntactically valid, but only one is semantically valid, and the model must choose the one that is semantically valid."
bigbench_conlang_translation,language understanding,language modeling,0,164,0,,BIG bench conlang translation consists of 164 example problems in which the model is given translations of simple sentences between English and some fake constructed language. The model is then tested for its ability to translate a complex sentence in the fake language into English.
bigbench_language_identification,language understanding,multiple choice,10,10000,9.1,,"BIG bench language identification consists of 10,000 four-choice multiple choice questions in which a sentence in some language besides english is presented and the model is prompted to identify the language of the sentence amongst four options."
bigbench_conceptual_combinations,language understanding,multiple choice,10,103,25,,BIG bench conceptual combinations consists of 103 four-choice multiple choice questions in which the model is presented with a made up word and its definition along with a multiple choice question regarding the meaning of a sentence using that made up word. The model is then expected to select the correct answer among the choices presented.
bigbench_elementary_math_qa,symbolic problem solving,multiple choice,10,38160,20,,"Big bench elementary math QA consists of 38,160 four-choice multiple choice arithmetic word problems."
bigbench_dyck_languages,symbolic problem solving,language modeling,10,1000,0,,"Big bench dyck languages consists of 1000 complete-the-sequence questions, in which a partially completed balanced expression consisting of parentheses and braces is given, and the model needs to output the exact tokens necessary in order to complete the balanced expression."
agi_eval_lsat_ar,symbolic problem solving,multiple choice,3,230,20,,"AGI Eval LSAT Analytical Reasoning consists of 230 four-choice multiple choice logic puzzles. The questions are taken from the AGI Eval benchmark."
bigbench_cs_algorithms,symbolic problem solving,language modeling,10,1320,0,,"Big bench cs algorithms consists of 1,320 samples of questions falling into one of two types. In the first type the model must determine the length of the longest common subsequence of two strings, and in the second type the model must determine whether an expression consisting of parentheses and braces is balanced"
bigbench_logical_deduction,symbolic problem solving,multiple choice,10,1500,20,,"Big bench logical deduction consists of 1500 four-choice multiple choice questions, in which the model is posed with a number of logical constraints describing the relative ordering of some number of objects. The model must then choose from among a list of four statements, which statement is the only one that is logically consistent with the constraints posed."
bigbench_operators,symbolic problem solving,language modeling,10,210,0,,"Big bench logical operators consists of 210 questions, in which a number of mathematical operators are defined and the model is expected to calculate the result of some expression consisting of those defined operators. This tests the model’s ability to handle mathematical abstractions and apply them appropriately."
bigbench_repeat_copy_logic,symbolic problem solving,language modeling,10,32,0,,"Big bench repeat copy logic consists of 32 tasks in which the model is commanded to repeat some combination of words some number of times in a particular order, and the model is expected to output the correct result."
simple_arithmetic_nospaces,symbolic problem solving,language modeling,10,1000,0,,Simple arithmetic with spaces was developed by MosaicML. It consists of 1000 arithmetic problems consisting of up to 3 operations and using numbers of up to 3 digits. There is spacing between all numbers and operators. The model is expected to calculate the correct result of the expression using the appropriate order of operations.
simple_arithmetic_withspaces,symbolic problem solving,language modeling,10,1000,0,,Simple arithmetic with spaces was developed by MosaicML. It consists of 1000 arithmetic problems consisting of up to 3 operations and using numbers of up to 3 digits. There is no spacing between any of the numbers and operators. The model is expected to calculate the correct result of the expression using the appropriate order of operations.
math_qa,symbolic problem solving,multiple choice,10,2983,20,,"Math QA consists of 2,983 four-choice multiple choice math word problems. The questions require basic reasoning, language comprehension, and arithmetic/algebraic skills."
logi_qa,symbolic problem solving,multiple choice,10,651,27,, LogiQA consists of 651 four-choice multiple choice logical word problems. The questions involve making logical deductions based on mathematical and symbolic descriptions of problems.
pubmed_qa_labeled,reading comprehension,language modeling,10,1000,0,,Pubmed QA L consists of 1000 hand-labeled medical documents followed by a related question for which the model must respond yes/no/maybe.
squad,reading comprehension,language modeling,10,10570,0,,"SQuAD consists of 10,570 short documents followed by a related question. The documents range from short news clippings about sports events, to blurbs explaining concepts in physics, to documents about US history. We expect the model to output the exact correct answer."
agi_eval_lsat_rc,reading comprehension,multiple choice,3,268,20,,"LSAT Reading Comprehension consists of 268 passage-based four choice multiple choice questions focused on a variety of information-focused domains like politics, business, economics, and science. The questions rely on the models ability to extract basic info from the texts."
agi_eval_lsat_lr,reading comprehension,multiple choice,3,510,20,,LSAT Logical Reasoning consists of 510 passage-based four choice multiple choice questions in which the model must draw complex conclusions from passages on a diverse range of subjects.
coqa,reading comprehension,language modeling,0,7983,0,,"CoQA consists of 7,983 passage-based short free response questions. For each passage there is a series of related questions. Each question is formatted with the document as well as all the preceding questions/answers provided in context. The model is evaluated using exact match accuracy."
bigbench_understanding_fables,reading comprehension,multiple choice,10,189,20,, Understanding fables consists of 189 short stories followed by 4-choice multiple choice in which the model must select the correct moral for the story.
boolq,reading comprehension,multiple choice,10,3270,62,," BoolQ consists of 3,270 short passages on a diverse range of subjects followed by a yes/no questions. The model is expected to answer in multiple-choice format."
agi_eval_sat_en,reading comprehension,multiple choice,3,206,25,, SAT English consists of 206 passage-based four choice multiple choice questions in which the model must comprehend highschool level passages.
winogender_mc_female,safety,multiple choice,10,60,50,,"The Winogender Multiple Choice (Female) dataset is a custom formatting of the Winogender dataset introduced specifically for this repository. It consists of a 60-sentence subset of the original Winogender dataset in which a female pronoun's resolution is syntactically ambiguous, but obvious from the semantics of the sentence, though it is constructed such that the pronoun resolves to a profession that is atypical for a woman, (e.g. engineer) according to BLS labor statistics. It is formatted as a two-choice multiple choice test in which the model must correctly identify which of two participants the pronoun refers to."
winogender_mc_male,safety,multiple choice,10,60,50,,"The Winogender Multiple Choice (Male) dataset is a custom formatting of the Winogender dataset introduced specifically for this repository. It consists of a 60-sentence subset of the original Winogender dataset in which a male pronoun's resolution is syntactically ambiguous, but obvious from the semantics of the sentence, though it is constructed such that the pronoun resolves to a profession that is atypical for a man, (e.g. secretary) according to BLS labor statistics. It is formatted as a two-choice multiple choice test in which the model must correctly identify which of two participants the pronoun refers to."
enterprise_pii_classification,safety,multiple choice,10,3395,50,,"Enterprise PII Classification was released by Patronus AI through the MosaicML Eval Gauntlet. It is a two-choice classification task in which a model is presented with small passages and must determine whether the passage contains sensitive or personal identifiable information (PII). It consists of 3,395 samples."
bbq,safety,multiple choice,3,58492,50,,"Bias Benchmark for QA (BBQ) is a multiple choice dataset constructed to determine a model's level of stereotyping and bias against 11 different protected groups. Though originally designed as a three-choice multiple choice test we reformatted it to use two choices. It consists of 58,492 samples split across 11 categories."
gpqa_main,world knowledge,multiple choice,5,448,25,,"GPQA is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics and chemistry."
gpqa_diamond,world knowledge,multiple choice,5,198,25,,A subset of highest quality questions from teh GPQA dataset
