
tasks:
  gsm8k:
    task_description: "Solve grade-school math word problems that require multi-step reasoning and the application of basic arithmetic operations (addition, subtraction, multiplication, and division). May include intermediate calculations to arrive at the final numeric answer."
    task_objective: "Final answer in format <answer>a number</answer>."
    sample_question: |  
      Jenny likes to play board games with her friends. She likes to play against her friend Mark the most, as she's played him 10 times and Mark has only won once. She doesn't like playing Jill, as she's played her twice as many times as she's played Mark and Jill has won 75% of them. How many times in total has Jenny won board games with her two friends?

  mtbench:
    task_description: "Handle multi-turn conversations across diverse tasks, including dialogue, instruction-following, roleplay, reasoning, coding, math, and domain-specific knowledge (STEM and humanities). Be coherence, context-awareness, and adherence to instructions over multiple turns."
    task_objective: ""
    sample_question: ""
    sample_questions: |
      "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions."
      "Rewrite your previous response. Start every sentence with the letter A."

  boolean_expressions:
    task_description: "Evaluate the result of a random Boolean expression."
    task_objective: "Final answer in format <answer>true or false</answer>."
    sample_question: |
      True and False and not True and True is

  causal_judgement:
    task_description: "Answer questions about causal attribution."
    task_objective: "Final answer in format <answer>Yes or No</answer>."
    sample_question: |
      How would a typical person answer each of the following questions about causation?
      Frank T., had an ongoing dispute with his neighbor over a stretch of land and one day decided to shoot his neighbor in the body. Frank T. had no experience with guns, his hand slipped on the barrel of the gun, and the shot went wild. Nonetheless, the bullet bounced off a large boulder several feet away and hit the neighbor's body, causing significant injury. Did Frank T. intentionally shoot his neighbor in the body?
      Options:
      - Yes
      - No
    
  date_understanding:
    task_description: "Infer the date from context."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Today is Christmas Eve of 1937. What is the date 10 days ago in MM/DD/YYYY?
      Options:
      (A) 12/14/2026
      (B) 12/14/1950
      (C) 12/14/2007
      (D) 12/14/1937
      (E) 07/14/1938
      (F) 12/14/1988

  disambiguation_qa:
    task_description: "Clarify the meaning of sentences with ambiguous pronouns."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      In the following sentences, explain the antecedent of the pronoun (which thing the pronoun refers to), or state that it is ambiguous.
      Sentence: The chief told the counselor that they took the day off.
      Options:
      (A) The chief took the day off
      (B) The counselor took the day off
      (C) Ambiguous

  dyck_languages:
    task_description: "Correctly close a Dyck-n word."
    task_objective: "Final answer in format <answer>your answer</answer>."
    sample_question: |
      Complete the rest of the sequence, making sure that the parentheses are closed properly. Input: < > ( ( [ [ ( { } ) [ < > ] ]

  formal_fallacies:
    task_description: "Distinguish deductively valid arguments from formal fallacies."
    task_objective: "Final answer in format <answer>valid or invalid</answer>."
    sample_question: |
      "It is not always easy to see who is related to whom -- and in which ways. The following argument pertains to this question: To begin with, Lesley is a close friend of Fernando. Moreover, being a close friend of Fernando or a schoolmate of Lowell is sufficient for being a great-grandfather of Leroy. It follows that Lesley is a great-grandfather of Leroy."
      Is the argument, given the explicitly stated premises, deductively valid or invalid?
      Options:
      - valid
      - invalid

  v-geometric_shapes:
    task_description: "Name geometric shapes from their SVG paths."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      This SVG path element <path d="M 31.00,73.00 L 32.00,59.00 L 44.00,50.00 L 49.00,41.00 L 64.00,37.00 L 71.00,55.00 L 64.00,76.00 L 52.00,61.00 L 31.00,73.00"/> draws a
      Options:
      (A) circle
      (B) heptagon
      (C) hexagon
      (D) kite
      (E) line
      (F) octagon
      (G) pentagon
      (H) rectangle
      (I) sector
      (J) triangle

  hyperbaton:
    task_description: "Order adjectives correctly in English sentences."
    task_objective: "Final answer in format <answer>A or B</answer>."
    sample_question: |
      Which sentence has the correct adjective order:
      Options:
      (A) rubber terrible ship
      (B) terrible rubber ship

  logical_deduction_five_objects:
    task_description: "A logical deduction task which requires deducing the order of a sequence of objects."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      The following paragraphs each describe a set of three objects arranged in a fixed order. The statements are logically consistent within each paragraph. In a golf tournament, there were three golfers: Amy, Eli, and Eve. Eve finished above Amy. Eli finished below Amy.
      Options:
      (A) Amy finished last
      (B) Eli finished last
      (C) Eve finished last

  logical_deduction_seven_objects:
    task_description: "A logical deduction task which requires deducing the order of a sequence of objects."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      The following paragraphs each describe a set of three objects arranged in a fixed order. The statements are logically consistent within each paragraph. In a golf tournament, there were three golfers: Amy, Eli, and Eve. Eve finished above Amy. Eli finished below Amy.
      Options:
      (A) Amy finished last
      (B) Eli finished last
      (C) Eve finished last

  logical_deduction_three_objects:
    task_description: "A logical deduction task which requires deducing the order of a sequence of objects."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      The following paragraphs each describe a set of three objects arranged in a fixed order. The statements are logically consistent within each paragraph. In a golf tournament, there were three golfers: Amy, Eli, and Eve. Eve finished above Amy. Eli finished below Amy.
      Options:
      (A) Amy finished last
      (B) Eli finished last
      (C) Eve finished last

  movie_recommendation:
    task_description: "Recommend movies similar to the given list of movies."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Find a movie similar to Star Wars Episode IV - A New Hope, Indiana Jones and the Last Crusade, Star Wars Episode V - The Empire Strikes Back, The Big Lebowski:
      Options:
      (A) Tetsuo
      (B) the Ironman
      (C) The Princess Bride
      (D) The Barkley Marathons The Race That Eats Its Young
      (E) Bug

  multistep_arithmetic_two:
    task_description: "Solve multi-step arithmetic problems."
    task_objective: "Final answer in format <answer>a number</answer>."
    sample_question: |
      ((-5 + 9 * -4 - 0) * (4 + -7 + 0 * -5)) =

  navigate:
    task_description: "Given a series of navigation instructions, determine whether one would end up back at the starting point."
    task_objective: "Final answer in format <answer>yes or no</answer>."
    sample_question: |
      If you follow these instructions, do you return to the starting point? Turn left. Turn around. Turn left. Take 7 steps. Take 2 steps. Take 4 steps. Take 8 steps.
      Options:
      - Yes
      - No

  object_counting:
    task_description: "Questions that involve enumerating objects and asking the model to count them."
    task_objective: "Final answer in format <answer>a number</answer>."
    sample_question: |
      I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape, a nectarine, and three plums. How many fruits do I have?

  penguins_in_a_table:
    task_description: "Answer questions about a table of penguins and their attributes."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We now add a penguin to the table:
      James, 12, 90, 12
      How many penguins are less than 8 years old?
      Options:
      (A) 1
      (B) 2
      (C) 3
      (D) 4
      (E) 5

  reasoning_about_colored_objects:
    task_description: "Answer extremely simple questions about the colors of objects on a surface."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      On the nightstand, there is a red pencil, a purple mug, a burgundy keychain, a fuchsia teddy bear, a black plate, and a blue stress ball. What color is the stress ball?
      Options:
      (A) red
      (B) orange
      (C) yellow
      (D) green
      (E) blue
      (F) brown
      (G) magenta
      (H) fuchsia
      (I) mauve
      (J) teal
      (K) turquoise
      (L) burgundy
      (M) silver
      (N) gold
      (O) black
      (P) grey
      (Q) purple
      (R) pink

  ruin_names:
    task_description: " Given an artist, band, or movie name, identify a one-character edit to the name that changes the meaning of the input and makes it humorous. but still are actual words."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Which of the following is a humorous edit of this artist or movie name: 'whitesnake'?
      Options:
      (A) whitesnape
      (B) whitesnapke
      (C) whitesnuake
      (D) mwhitesnake

  salient_translation_error_detection:
    task_description: "Detect the type of error in an English translation of a German source sentence."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      The following translations from German to English contain a particular error. That error will be one of the following types: Named Entities: An entity (names, places, locations, etc.) is changed to a different entity. Numerical Values: Numerical values (ordinals or cardinals), dates, and/or units are changed. Modifiers or Adjectives: The modifiers and adjectives pertaining to a noun are changed. Negation or Antonyms: Introduce or remove a negation or change comparatives to their antonyms. Facts: Trivial factual errors not pertaining to the above classes are introduced in the translations. Dropped Content: A significant clause in the translation is removed. Please identify that error.  Source: In der Liste der Baudenkmale in Lenzen (Elbe) sind alle Baudenkmale der brandenburgischen Stadt Lenzen (Elbe) und ihrer Ortsteile aufgelistet.
      Translation: In the list of architectural monuments in Lenzen all architectural monuments of the Brandenburg city of Lenzen and its districts are listed.
      The translation contains an error pertaining to
      Options:
      (A) Modifiers or Adjectives
      (B) Numerical Values
      (C) Negation or Antonyms
      (D) Named Entities
      (E) Dropped Content
      (F) Facts

  snarks:
    task_description: "Determine which of two sentences is sarcastic. According to Cambridge University Dictionary, sarcasm is 'the use of remarks that clearly mean the opposite of what they say, made in order to hurt someone's feelings or to criticize something in a humorous way.' Sarcastic sentences often contain satirical or ironic utterances, hyperboles, ambivalent or witty remarks."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Which statement is sarcastic?
      Options:
      (A) Yes, because having interests and actively researching them is a huge waste
      (B) Yes, because having interests and actively researching them is a huge deal

  sports_understanding:
    task_description: "Determine whether an artificially constructed sentence relating to sports is plausible or not."
    task_objective: "Final answer in format <answer>yes or no</answer>."
    sample_question: |
      Is the following sentence plausible? "Bam Adebayo scored a reverse layup in the Western Conference Finals."

  temporal_sequences:
    task_description: "Answer questions about which times certain events could have occurred."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Today, Emily went to the museum. Between what times could they have gone?
      We know that:
      Emily woke up at 1pm.
      Elizabeth saw Emily reading at the library from 2pm to 4pm.
      Jessica saw Emily watching a movie at the theater from 4pm to 5pm.
      Leslie saw Emily waiting at the airport from 5pm to 6pm.
      William saw Emily buying clothes at the mall from 6pm to 7pm.
      The museum was closed after 7pm.
      Between what times could Emily have gone to the museum?
      Options:
      (A) 1pm to 2pm
      (B) 6pm to 7pm
      (C) 5pm to 6pm
      (D) 2pm to 4pm

  tracking_shuffled_objects_five_objects:
    task_description: "A task requiring determining the final positions of a set of objects given their initial positions and a description of a sequence of swaps."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire has a pink ball.
      As the game progresses, pairs of players trade balls. First, Claire and Alice swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Bob has the
      Options:
      (A) yellow ball
      (B) blue ball
      (C) pink ball

  tracking_shuffled_objects_seven_objects:
    task_description: "A task requiring determining the final positions of a set of objects given their initial positions and a description of a sequence of swaps."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire has a pink ball.
      As the game progresses, pairs of players trade balls. First, Claire and Alice swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Bob has the
      Options:
      (A) yellow ball
      (B) blue ball
      (C) pink ball

  tracking_shuffled_objects_three_objects:
    task_description: "A task requiring determining the final positions of a set of objects given their initial positions and a description of a sequence of swaps."
    task_objective: "Final answer in format <answer>a multiple-choice option</answer>."
    sample_question: |
      Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a yellow ball, Bob has a blue ball, and Claire has a pink ball.
      As the game progresses, pairs of players trade balls. First, Claire and Alice swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Bob has the
      Options:
      (A) yellow ball
      (B) blue ball
      (C) pink ball

  web_of_lies:
    task_description: "Evaluate a random boolean function expressed as a word problem."
    task_objective: "Final answer in format <answer>yes or no</answer>."
    sample_question: |
      Question: Fidel tells the truth. Jerry says Fidel tells the truth. Vina says Jerry tells the truth. Millicent says Vina lies. Raymond says Millicent lies. Does Raymond tell the truth?

  word_sorting:
    task_description: "Sort a list of words."
    task_objective: "Final answer in format <answer>sorted word list</answer>."
    sample_question: |
      Sort the following words alphabetically: List: hypochlorite ponderosa phone credulity