# Document processing
1. extract_old_wiki_urls.py
2. merge_and_clean.py
3. construct_200k_kb.py - run after Query processing 1. preprocessing.py
4. read_document.py
5. download_document_images.py
6. remove_nonexisting_images.py
7. remove_empty_doc.py
8. construct_train_val_kb.py
9. construct_test_kb.py

# We omit the construct_train_val_kb.py and construct_test_kb.py since the KB size is small compared to
# the ones from the Encyclopedic-VQA and InfoSeek.

# Query processing
1. preprocess.py - run after Document processing 1. extract_old_wiki_urls.py
2. evidence_name_to_id.py - run after the Document processing - 4. read_document.py


We will ultimately change the dataset and KB into encyclopedic-vqa format.
The total number of queries in the ViQuae is small; about 6k queries.
Since the KB for the dataset is large, we will only process subset of KB, whose size is 100k.
