# Dissecting Zero-Shot Visual Reasoning Capabilities in Vision and Language Models

## Folder structure

The main directories and their content are organized as follows:

- `/clevr`: This directory contains the following subdirectories:
    - `/blip2-flant5`: This directory contains the following files:
        - `blip2_instructed_generation_clevr_final.ipynb` - Standard prompting for `blip2-flan-t5` model family
        - `blip2_instructed_generation_clevr_final_cot.ipynb`
    - `/blip2-flant5-metadata`
        - `blip2_instructed_generation_clevr_final.ipynb` - VLM Standard Prompting with Metadata information
    - `/flant5`: This directory contains the following files:
        - `t5_instructed_generation_final_cot.ipynb` - CoT prompting for `flan-t5` question family.
        - `t5_instructed_generation_final.ipynb` - - Standard prompting for `flan-t5` question family.
    - `gpt`: This directory contains the following files:
        - `Instruct GPT-3 CLEVR.ipynb` - GPT experiments for CLEVR dataset.
      

- `/ptr`: This directory contains the following subdirectories:
    - `/blip2-flant5`: This directory contains the following files:
        - `blip2_instructed_generation_ptr_final_cot.ipynb`
        - `blip2_instructed_generation_ptr_final.ipynb`
    - `/blip2-flant5-metadata`
        - `blip2_instructed_generation_ptr_final.ipynb` - VLM Standard Prompting with Metadata information
    - `/flant5`: This directory contains the following files:
        - `t5_instructed_generation_final_cot.ipynb`
        - `t5_instructed_generation_final.ipynb`

    - `gpt`: This directory contains the following files:
        - `Instruct GPT-3 PTR.ipynb`
- `/gqa`: This directory contains one notebook which was used for running the `GQA` experiments. Relevant instructions are provided in the notebook. 

- `image_free`: Code to run image free baseline on CLEVR and PTR datasets. 

- `/eval`: This is the evaluation code directory contains the following files:
    - `blip2-flant5 eval.ipynb` 
    - `flan-t5 output eval.ipynb` 
    - `gpt eval analysis.ipynb` 

## Dependencies

The 2 major modules required for these experiments are provided below:
- [The huggingface transformers](https://huggingface.co/docs/transformers/index) library for LLM experiments. 
- [The Salesforce-LAVIS library](https://github.com/salesforce/LAVIS.git) for VLM experiments. 

Kindly follow the [installation instructions](https://github.com/salesforce/LAVIS#installation) provided for the lavis module to download all dependencies for the project. 


## Processing the datasets

The CLEVR and PTR datasets can be found here:
- [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/)
- [PTR](http://ptr.csail.mit.edu/)

There are `load_data()` functions available in every jupyter notebook which can process and store the data in the expected format. The final outputs should be a single json file for the whole dataset, which will contain the `metadata` as well as `questions` for each of the scenes in the dataset. Note that the processed datasets only need to be generate once, after which they can be used for every experiment. 

## Running the experiments

Each of the jupyter notebooks will have a `run_on_dataset(scene_mapping: dict)` method, which will take the processed dataset as input and return and store the final answers for each question. The different model selections as well as dataset and prompting options are provided in the notebooks. 

## Evaluating the results

The `eval` folder contains notebooks to process and provide results and visualizations for the performance of each model. All the result calculation and processing code has been defined and the only the final output files from the experiments are required for getting the results. 