# Designing Visual Encoding in MLLMs: A Vision-Centric Analysis

This is the code for Paper "*Designing Visual Encoding in MLLMs: A Vision-Centric Analysis*". The script below uses slurm to submit jobs, you can either create your own slurm script or simply remove the `sbatch scripts/1N?G.slurm` to run directly.

### Install

Follow the below steps to install setup the environment. The environment is only tested under aarch64, but it should work after proper modification on the sources of packages.

```shell
# conda create -n llava python=3.10 -y
# conda activate llava
pip install -r requirements.txt
pip install -e .
```

### Preparing Data

Follow the same steps as the original LLaVA to prepare the data needed for training and evaluation. For CV-Bench and PixCV-Bench, run the following command to download and process the data:

```
huggingface-cli download --repo-type dataset --resume-download IVUlab/cvbench --local-dir playground/data/eval/cvbench
huggingface-cli download --repo-type dataset --resume-download IVUlab/pixcvbench --local-dir playground/data/eval/pixcvbench
cd playground/data/eval/cvbench
python convert.py
```

### Generating Region Masks from SAM

`scripts\regions` contains the scripts for generating SAM masks. Run `scripts\regions\gen.sh` to generate all the SAM masks needed for the experiments. 

### Start Training

`scripts\run.sh` contains all the experiments included in the paper.

### Evaluation

Run `scripts\eval\run_eval.sh` to evaluate trained models. Additionally, `scripts\eval\pixcv_stats.sh` computes the average number of visual tokens per image and the focus metric using annotations in PixCV-Bench.
