{
    "ex_safety": "This dataset is designed to evaluate large language models for exaggerated safety behaviors. These behaviors occur when models refuse to comply with safe prompts because they contain language similar to that of unsafe prompts or mention sensitive topics. This dataset aims to test whether models can identify genuinely unsafe prompts while not overly restricting safe prompts.",
    "GSM8K": "It is a dataset of high quality linguistically diverse grade school math word problems created by human problem writers. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning. Each problem should only have one question and one correct answer. ",
    "TRAM-storytelling": "Each entry in the dataset includes a user's query and a list of tool options. The model is required to select the most appropriate tool from the list that can best address the query. The dataset is designed to test the model's ability to choose the right tool.",
    "TRAM-causality": "Each entry in the dataset includes a user's query and a list of tool options. The model is required to select the most appropriate tool from the list that can best address the query. The dataset is designed to test the model's ability to choose the right tool.",
    "BoolQ": "This dataset is a question-and-answer dataset on reading comprehension. Given the title of a passage and the content of it, it requires providing a \"true\" or \"false\" answer to the given question. These questions are unexpectedly challenging as they often query for complex, non-factoid information and require difficult entailment-like inference to solve.",
    "HellaSwag": "This dataset consists of multiple-choice questions designed to test the logical reasoning and contextual understanding of AI models. Each question sets up a scenario and asks \"What happens next?\" with four potential answers. Only one answer is logically sound and contextually appropriate, while the other three are implausible, either contradicting the scenario's details or representing unlikely outcomes.\nThe purpose of these questions is to challenge AI models to use logical sequencing, inferential reasoning, and practical insights effectively. This dataset aims to refine AI abilities in predicting logical continuations in scenarios that mimic real-life logic and events, ensuring the challenges are complex and thought-provoking.",
    "MMLU-cons": "It is a large-scale, multi-task language understanding dataset designed to evaluate language models' capabilities across various language understanding tasks. The dataset questions are presented in a multiple-choice format, each with a question (referred to as \"text\") followed by options. Each question is associated with a correct answer (\"label\")",
    "causal-judgement": "The dataset contains various scenarios designed to test causal judgment. Each entry includes a scenario described in detail, followed by a question about the causality involved, and multiple-choice options for answers. The target indicates the expected answer to the question based on typical causal reasoning.",
    "metatool-thought": "The dataset contains many user queries, which often include problems that large language models cannot solve, such as obtaining real-time information and processing non-text modal information. These queries typically require large language models to use external tools to find solutions, thus testing whether the models are aware of when to use tools; that is, they should choose to use tools when necessary and not use them otherwise.",
    "BBH-bool": "The dataset consists of Boolean expressions and their respective evaluations. Each entry in the dataset is a pair, comprising a Boolean expression (as a question) and the expected result (as a label). The Boolean expressions include combinations of True, False, and, or, and not operators, testing various logical conditions. This dataset is useful for training models to understand and evaluate Boolean logic.",
    "TruthfulQA": "This dataset is designed to measure the truthfulness and accuracy of answers generated in response to common questions, some of which are often answered incorrectly by humans due to widespread misconceptions or false beliefs. The purpose of the dataset is to evaluate how well a model can distinguish factual accuracy from popular myths or erroneous understandings in various domains including history, science, and general knowledge.\nEach entry in the dataset consists of a question followed by multiple choice answers where only one is correct. The dataset challenges the model's ability to use historical data, scientific facts, and logical reasoning to select the correct answer over plausible but incorrect alternatives that might reflect common misunderstandings.",
    "multiNLI": "The dataset is a crowd-sourced collection of sentence pairs annotated with textual entailment information. Each data item contains two different sentences and has the label \"neutral\", \"contradiction\", or \"entailment\".",
    "CRASS-FTM": "The dataset appears to consist of scenarios and corresponding questions about causal relationships. Each entry in the dataset includes a premise, a question about the causation, and multiple possible answers, including the correct answer.",
    "ARC-challenge": "The dataset is designed to test the model’s ability to understand and correctly answer science questions at a grade-school level, focusing on assessing capabilities such as comprehension, reasoning, and application of scientific knowledge. Each entry in the dataset consists of a question followed by multiple choice answers where only one is correct.",
    "emobench": "This dataset is designed to facilitate the understanding of complex emotional scenarios. Each entry in the dataset presents a unique scenario accompanied by a question and multiple emotional response options. The goal is to determine the most appropriate emotional response for the individual involved in the scenario. Only ONE option is tagged as the correct emotional response.",
    "MMLU": "It is a large-scale, multi-task language understanding dataset designed to evaluate language models' capabilities across various language understanding tasks. The dataset questions are presented in a multiple-choice format, each with a question (referred to as \"text\") followed by four options (labeled A, B, C, and D). Each question is associated with a correct answer (\"label\")"
}