# Document processing
1. extract_old_wiki_urls.py
2. read_raw_html.py
3. download_document_images.py
4. remove_nonexisting_images.py
5. construct_train_val_kb.py
6. construct_test_kb.py
# We omit the construct_train_val_kb.py and construct_test_kb.py since the KB size is small compared to
# the ones from the Encyclopedic-VQA and InfoSeek.


# Query processing
1. extract_wikitablequestions.py - Run after the Document processing - 2. read_raw_html.py

We will ultimately change the dataset and KB into encyclopedic-vqa format.
The Open-WikiTable dataset contains the WikiTableQuestions dataset that provides old wikipedia urls.
Hence, we will only use the WikiTableQuestions part, leveraing the open-QA format of the Open-WikiTable dataset.

The original query dataset ({split}.json) contains the following:

- question_id: the unique ID for the each of the question in train/valid/test
- original_table_id: the original table id from WikiSQL and WikiTableQuestions. Tables are split row-wise
  into 100-word chunks, then re-indexed which can be found in file splitted_tables.json
- question: the decontextualized and paraphrased version of the question
- sql: the corresponding SQL query for the question
- answer: the answer for each question in a format of python list
- hard_positive_idx: the index of the splitted table that has every condition that the question is asking for
- positive_idx: the index of the splitted table that has at least one but not every condition that the
  question is asking for. For example, when the question is asking for two conditions (e.g. NFL Team = "New England Patriots" and Position = "Running back"), the hard_positive table has both of the entities inside whereas the positive table has either one of them
- negative_idx: the index of the splitted table that is similar to the grounding table based on BM25
- dataset: the origin of the dataset

In the page/xxx-page/yyy.json metadata includes the URL, the page title, and the index of the chosen table. We will use the URL to generate interleaved docuemnt.

The final form of Open-WikiTable dataset is as below:

- question: The question Q to be used for the VQA triplets.
- answer: The answer to the question. This field may contain multiple answers:
  if the question was answered by multiple annotators, answers are separated by '|'.
  In case of the multi_answer questions, individual answers are separated by '&&'.
- wikipedia_url: The URL of the Wikipedia article that corresponds to the knowledge base for the question.
  This URL acts as a key to our provided knowledge base.
- evidence_section_id: An integer identifier indicating the section of the knowledge base where the evidence
  can be found. For two_hop questions there are two IDs separated by '|'.
- open_wikitable_split: This defines the split in our Encyclopedic-VQA dataset: train, val_unseen_question,
  or val_unseen_entity.

