# PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

## Requirements
* Install Detectron2 (0.6)
* Install requirements
```
pip install -r requirements.txt
```
* The remaining instructions depend on the creation of a conda environment called "pixmmvp" with the above requirements. If you use a different name modify accordingly.

## Dataset setup
* Install the original [MMVP](https://www.google.com/search?client=safari&rls=en&q=MMVP+github&ie=UTF-8&oe=UTF-8$0) dataset for VQA.
* Follow this structure for the dataset folder
```
|--- MMVP
   |--- MMVP Images
   |--- Questions.csv
   |--- Objects.csv
   |--- Segmentations.json
   |--- visual_patterns.csv
```

* To visualize the dataset (visualizations available under Dataset/MMVPSeg/).

```
cd pixmmvp/eval/dataset_helpers/
python test_loader.py --root DATA_DIR --out_dir VIS_DIR
```

## Evaluation
* Download the following codes
```
git clone https://github.com/lxtGH/OMG-Seg
git clone https://github.com/mbzuai-oryx/groundingLMM
git clone https://github.com/dvlab-research/LISA
git clone https://github.com/UX-Decoder/LLaVA-Grounding
git clone https://github.com/haotian-liu/LLaVA
git clone https://groundlmm.github.io/
git clone https://github.com/cambrian-mllm/cambrian
```

* Create the respective conda environment for each of the above codes with their requirements.

* Modify pixmmvp/scripts/... with the created conda envirnonment for their respective bash script.

### Pixel-level MLLMs

* Select PROMPT flag according to the right probing.
```
PROMPT: 0 (First probing in the manuscript, protocol1)
PROMPT: 1 (Second probing in the manuscript, protocol3)
PROMPT: 3 (Third probing in the manuscript, protocol2)
```

* Run inference, OMG-LLaVA as an example.
```
cd pixmmvp/scripts/
bash eval_omgllava.sh DATA_DIR OUT_DIR PROMPT ANS_DIR MODEL_DIR
```

* Run evaluation VQA.
```
python ../eval/protocol1_accuracy.py --openai_api_key API_KEY --answer_file ANS_FILE_protocol1.jsonl
python ../eval/protocol2_accuracy.py --answers_file ANS_FILE_protocol2.jsonl
```

* Run evaluation referring segmentation.
```
conda actiate pixmmvp
python ../eval/eval_iou.py --dataset_root DATA_DIR --preds_dir OUT_DIR
```

### Vanilla MLLMs
* Select PROMPT flag according to select the right probing.
```
PROMPT: 0 (First probing in the manuscript, protocol1)
PROMPT: 1 (Second probing in the manuscript, protocol3)
PROMPT: 3 (Third probing in the manuscript, protocol2)
```

* Run inference, LLaVA 1.5 7B as an example.
```
export PYTHONPATH=LLAVA_CODE_DIR/
bash eval_llava.sh DATA_DIR OUT_DIR PROMPT ANS_DIR META_DIR/
```

* Run automatic mask selection using GPT-4o. Ensure to run it at two steps where you send STAGE as 1 then 2.
```
bash infer_gptauto.sh llava-1.5-7b-liu DATA_DIR OUT PROMPT META_DIR STAGE
```

* Remaining steps for evaluating VQA is similar to pixel-level MLLMs.

* Choose one of the following for the MASK_SELECTION method
```
MASK_SELECTION : 'auto' (automatic baseline)
MASK_SELECTION : 'oracle' (oracle baseline)
MASK_SELECTION : 'spacy_score' ((a+s) baseline)
```

* Evaluate the referring segmentation, LLaVA 1.5 7B as an example
```
bash eval_pixfoundation.sh llava-1.5-7b-liu DATA_DIR OUT_DIR VIS_DIR PROMPT META_DIR MASK_SELECTION
```

* To reproduce the When analysis
```
bash eval_pixfoundation_when.sh DATA_DIR OUT_DIR ANS_DIR META_DIR VIS_DIR WHEN_DIR
```

## Results
* We provide output samples for the segmentation predictions under (Output/mmvp_output/), note for the vanilla MLLMs the predictions of all the output tokens are included. You will still need to run the evaluation selecting either (a+s), oracle or automatic baselines to identify the right mask.

* We also provide the output answers (Output/mmvp_answers), meta files (Output/mmvp_meta) for all the models.

* Finally, we provide the when analysis including both the location and the concept in numpy files under (Output/mmvp_when) for inspection.

* We provide here PixMMVP results showing Accuracy using the third probing and mIoU using the second probing. For the vanilla MLLMs we report mIoU using the automatic baseline.

| Model name         | Accuracy | mIoU |
| ------------------ |---------------- | -------------- |
| OMG LLaVA   |    12.0\%         |      38.0\%       |
| GLAMM   |     2.7\%         |      47.4\%       |
| LISA   |     0\%         |      42.9\%       |
| LLaVA-G   |     0\%         |      13.5\%       |
| PixFoundation 7B   |     28.0\%         |      25.9\%       |
| PixFoundation 13B   |     30.0\%         |      25.0%       |
| PixFoundation 8B   |    52.0\%         |      30.3\%       |


## License
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the specific licenses for base language models for checkpoints used in the evaluation (e.g. Llama community license for LLaMA-3, and Vicuna-1.5). MMVP dataset and code follows MIT license. This project does not impose any additional constraints beyond those stipulated in the original licenses. 