## Intermediate Schema Creation

This readme contains instructions on converting any dataset into intermediate schema. To begin with we will define the schema

```json
INTERMEDIATE_SCHEMA = {
    "task_type": "GenerativeMCQ",
    "dataset": "",
    "original_dataset_metadata": "https://huggingface.co/datasets/google/boolq/",
    "dataset_input": "", # Instance given to the LLM without any instruction. 
    "candidate_answer_set": [], # the list of all posssible answers for that instance
    "candidate_answer_label_space": [], # the list of all posssible answer labels
    "ground_truth_answer_label": "", 
    "ground_truth_answer_text": "",
    "dataset_instruction": "", # Task Prompt. Task prompt should not define how to generate the answer.
    "final_suffix_task_instruction": "", # The final task instruction which gets appended to the input and original_task_instruction
    "final_prefix_task_instruction": "", # The final task instruction which gets prepended to the input and original_task_instruction
    "task_instructions": [],
    "instruction_output": [],
}
```

Every instance needs to be converted to the above format. For example, an instance in BoolQ dataset looks like this

```json
{
    "passage": "Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.",
    "question": "do iran and afghanistan speak the same language",
    "answer": true
}
```

will look like

```json
{
    "task_type": "MCQ",
    "dataset": "Boolq",
    "original_dataset_metadata": "https://huggingface.co/datasets/google/boolq/",
    "dataset_input": "Passage: All biomass goes through at least some of these steps: it needs to be grown, collected, dried, fermented, distilled, and burned. All of these steps require resources and an infrastructure. The total amount of energy input into the process compared to the energy released by burning the resulting ethanol fuel is known as the energy balance (or ``energy returned on energy invested''). Figures compiled in a 2007 report by National Geographic Magazine point to modest results for corn ethanol produced in the US: one unit of fossil-fuel energy is required to create 1.3 energy units from the resulting ethanol. The energy balance for sugarcane ethanol produced in Brazil is more favorable, with one unit of fossil-fuel energy required to create 8 from the ethanol. Energy balance estimates are not easily produced, thus numerous such reports have been generated that are contradictory. For instance, a separate survey reports that production of ethanol from sugarcane, which requires a tropical climate to grow productively, returns from 8 to 9 units of energy for each unit expended, as compared to corn, which only returns about 1.34 units of fuel energy for each unit of energy expended. A 2006 University of California Berkeley study, after analyzing six separate studies, concluded that producing ethanol from corn uses much less petroleum than producing gasoline.\nQuestion: does ethanol take more energy make that produces\nAnswer:",
    "candidate_answer_set": ["True", "False"],
    "candidate_answer_label_space": ["0", "1"],
    "ground_truth_answer_label": "False",
    "ground_truth_answer_text": "False",
    "instruction_input": "Given a passage and a boolean question, answer the question by generating True or False.\n",
    "instruction_output": "False"
}
```