# Steps to Generate MetaDataset

<!-- ## [Quick]  MetaDataset Preview
The quick way preview the dataset without generating it is to look at the samples that we have uploaded.  -->

<!-- ## Detail Instruction to Generate MetaDataset -->

## Download Visual Genome
We use the pre-processed and cleaned version of Visual Genome by [Hudson and Manning](https://arxiv.org/pdf/1902.09506.pdf). 

- Download image files (~20GB) from: 
https://nlp.stanford.edu/data/gqa/images.zip

```
wget -c https://nlp.stanford.edu/data/gqa/images.zip
unzip images.zip -d allImages
```

- [Optional] Download the annotations provided the base dataset (scene graphs): 
https://nlp.stanford.edu/data/gqa/sceneGraphs.zip  

Although you don't need `sceneGraphs.zip` to run the code in this repo, it provides the detailed annotations for each image that might be useful to your project. 
```
wget -c https://nlp.stanford.edu/data/gqa/sceneGraphs.zip  
unzip sceneGraphs.zip -d sceneGraphs
```



After this step, the base dataset file structure should look like this:
```
/data/GQA/
    allImages/
        images/
            <ID>.jpg
    sceneGraphs/
        train_sceneGraphs.json
        val_sceneGraphs.json
```



## Specify local path of Visual Genome
Extract the files, and then specify the folder path 
(e.g., `IMAGE_DATA_FOLDER=/data/GQA/allImages/images/`) in [Constants.py](Constants.py). 
In addition, also specify the destination folder (e.g., `PYTORCH_DATASET_FOLDER=/data/MetaDataset`)

### Generate MetaDataset
```sh
python generate_full_metadataset.py
```

Need to specify the following arguments defined in [Constants.py](Constants.py). 

The base dataset folder: `IMAGE_DATA_FOLDER=/data/GQA/allImages/images/`

The destination folder: `PYTORCH_DATASET_FOLDER=/data/MetaDataset`

Only generate MetaDataset for selected classes. `ONLY_SELECTED_CLASSES = True`. 
Change to False to generate the whole meta-dataset; However, that would be very large. 

If `ONLY_SELECTED_CLASSES` is True, we only generate MetaDataset for the following classes. 
`SELECTED_CLASSES = [
    'cat', 'dog',
    'bus', 'truck',
    'elephant', 'horse',
    'bowl', 'cup',
    ]` 

If `ONLY_SELECTED_CLASSES` is False, this argument would be ignored. 
