# Query processing
1. preprocess.py
2. extract_subset.py - Run after # Document processing - 1. check_valid_entity.py

# Document processing
1. check_valid_entity_wiki.py
2. construct_800k_kb.py
3. read_document.py
4. download_document_images.py
5. remove_nonexisting_images.py
6. construct_test_kb.py
7. construct_train_val_kb.py

We will ultimately change the dataset and KB into encyclopedic-vqa format.

## Dataset

Note that the infoseek test dataset does not provide the entity_id (wikidata_id).
Hence, previous papers split the train dataset into train / val dataset and use the original val dataset as a test dataset.

info_seek_${split}.jsonl

- data_id: 'infoseek_train_00000009'
- image_id: 'oven_00057589'
- question: 'What is this place named after?'
- answer: ['Spain']
- answer_eval: ['Kingdom of Spain', 'ESP', '🇪🇸', 'ES', 'Spain']
- data_split: 'train'

info_seek_${split}_withkb.jsonl

- data_id: 'infoseek_train_00380640'
- entity_id: 'Q200339' (connected to the wikidata_id of the KB)
- entity_text: 'Mandarin duck' (entity name)


We need to add 'evidence_id' item to the dataset, which will be implemented through the GPT4 implementation.
The final data format should be csv file that should be loaded through pandas dataframe where each row contains below:

- question: The question Q to be used for the VQA triplets.
- answer: The answer to the question. This field may contain multiple answers:
  if the question was answered by multiple annotators, answers are separated by '|'.
  In case of the multi_answer questions, individual answers are separated by '&&'.
- dataset_image_ids: A list of up to 5 identifier for the image associated with the question.
  The IDs correspond to the images from the image dataset.
- wikipedia_url: The URL of the Wikipedia article that corresponds to the knowledge base for the question.
  This URL acts as a key to our provided knowledge base. For two_hop questions this field contains the two
  consecutive URLs separated by '|'.
- evidence_section_id: An integer identifier indicating the section of the knowledge base where the evidence
  can be found. For two_hop questions there are two IDs separated by '|'.
- encyclopedic_vqa_split: This defines the split in our Encyclopedic-VQA dataset: train, val, or test.
- question_original: The original text of the question before any rephrasing.


The final form of InfoSeek dataset is as below:

- question: The question Q to be used for the VQA triplets.
- answer: The answer to the question. This field may contain multiple answers:
  if the question was answered by multiple annotators, answers are separated by '|'.
  In case of the multi_answer questions, individual answers are separated by '&&'.
- answer_eval: Some other forms of answer for evaluation.
- dataset_image_ids: A list of up to 5 identifier for the image associated with the question.
  The IDs correspond to the images from the image dataset.
- wikipedia_url: The URL of the Wikipedia article that corresponds to the knowledge base for the question.
  This URL acts as a key to our provided knowledge base.
- evidence_section_id: An integer identifier indicating the section of the knowledge base where the evidence
  can be found. For two_hop questions there are two IDs separated by '|'.
- infoseek_split: This defines the split in our Encyclopedic-VQA dataset: train, val_unseen_question, or val_unseen_entity.


--------------------------------------------------------------------------------------------------------------



## Wikipedia Knowledge Base

For the KB, the infoseek KB (Wiki6M_ver_1_0_title_only.jsonl, don't need Wiki6M_ver_1_0.jsonl) contains

- wikidata_id: 'Q2819676' (connected to the data_id of the dataset)
- wikipedia_title: 'List of A Certain Magical Index chapters'

We have to change the above document into encyclopedic-vqa format, which is a dictionary where
key is url and value is a dictionary as below:

- title: Title of the Wikipedia article
- section_titles: List with titles of each section. Its first element is identical to title.
- section_texts: List with contents for each section.
- image_urls: List with urls to images within the Wikipedia article.
- image_reference_descriptions: List with reference descriptions (i.e. captions) of the images.
- image_section_indices: List of integers denoting the sections where each image belongs to (i.e. index in
  section_titles and section_texts).
- url: The wikipedia_url (again)