# Generate Customized CLEVR Datasets
This code is a fork from
https://github.com/facebookresearch/clevr-dataset-gen (CLEVR dataset generation) and 
https://github.com/facebookresearch/clevr-iep (features generation and base models). 

### Features generation
Once you have your images, you can extract features using a pretrained ResNet-101.
To do so, run the commands in `python3 main.py --run gen_features`. 
Specify the path using `--path_dataset` which contains the folders named `*_*_images` like `train_A_images`.

### Templates generation
You can generate the template by running the commands in `python3 main.py --run gen_templates`.
We have provided the 2Hop templates in `CLEVR_dataset_generation/question_generation/two_hop_templates`. For these templates, we have added the constraints in their respective json files. The constraints are defined in `CLEVR_dataset_generation/question_generation/generate_questions.py` after line 270.

### Questions generation
To generate questions, look at the commands in `launch_twohop_question_generation.sh` as a sample. 

There might be a case (less than 1% of questions) in 3/2Hop OOD that might have some overlap with 2Hop A. We used the `verify_two_hop.py` to generate a mask to exclude these questions from the OOD test sets when we compute the accuracies.

### D3 sets generation
To generate a sample D3 set, look at the commands in `augment_questions.sh` as a sample.

### Question complexity distribution generation
To generate datasets for question complexity distributions run the following:
First you need to merge the generated json question files. It randomly samples 400k questions from each the 3 question files and generates a new question file with a total 1.2M questions:

```
python merge_question_files.py --question_json_files data/two_hop_datasets/questions/2HopA/CLEVR_trainA_questions.json  data/two_hop_datasets/questions/0HopA/CLEVR_trainA_questions.json data/two_hop_datasets/questions/1HopFull/CLEVR_trainA_questions.json --output_json_file  data/two_hop_datasets/questions/LARGE_wide/CLEVR_trainA_questions.json --num_samples 400000
```

Then, generate the datasets with different distributions and a total of 800k questions using a catalog file:

```
python gen_dataset_dists.py --wide_json_file data/two_hop_datasets/questions/LARGE_wide/CLEVR_trainA_questions.json --catalog sample_dataset_catalog.json --output data/two_hop_datasets/wide_subsets/800k --total 800000
```

Note that the questions can then be fed to the CLOSURE for running train and tests.
