# HieRD: Hierarchical Relational Distillation for Vision-Language Embedding Models

This repository contains the official implementation of **HieRD**, a hierarchical relational distillation framework for vision-language embedding models.

---

## Environment Setup

### Create a Python virtual environment

```bash
apt-get update
apt-get upgrade -y
python -m venv vlm
source vlm/bin/activate
```

### Install Dependencies

```
pip install -r requirements.txt
```

### Prepare data

1. Download the eval image file zip from huggingface (`optional`)

```bash
wget https://huggingface.co/datasets/TIGER-Lab/MMEB-eval/resolve/main/images.zip
unzip images.zip -d eval_images/
```

2. Download train image, it can take > 1 hour to download

```bash
bash download_traindata.sh
bash download_traindata_2.sh
```

3. Fix some line code

Because of the error of code in **Transformers library**, run the following script to find the error and comment some lines:

Just comment the following code, from line 140 to 143 in file **/vlm/lib/python3.12/site-packages/transformers/models/qwen2_vl/image_processing_qwen2_vl.py**:

```python
if size is not None and ("shortest_edge" not in size or "longest_edge" not in size):
    raise ValueError("size must contain 'shortest_edge' and 'longest_edge' keys.")
else:
    size = {"shortest_edge": 56 * 56, "longest_edge": 28 * 28 * 1280}
```

Or run `fix_lib.py` to fix:

```python
python fix_lib.py
```

## Our Method

The implementation of **HieRD** is located in the following files:

- FastVLM student: `src/criterions/span_propose_attn.py`
- LLaVA-OneVision-0.5B student: `src/criterions/span_propose_attn_llava_ov.py`

---

## Training

Example training scripts for the FastVLM 0.5B student model are provided in the `scripts` folder.

### For CLS Benchmark

To train the FastVLM 0.5B student model using HieRD on the CLS benchmark:

```bash
bash scripts/train_distill_span_weighted_cls.sh
```

### For VQA Benchmark

To train the FastVLM 0.5B student model using HieRD on the VQA benchmark:

```bash
bash scripts/train_distill_span_weighted_vqa.sh
```

---

## Inference & Evaluation

To evaluate the trained model on an MMEB dataset:

1. Update the following parameters in `eval.sh`:
   - `model_name`: Path or identifier of your trained model
   - `encode_output_path`: Directory where evaluation outputs will be saved
   - `model_backbone`: Backbone architecture of your model
   - `subset_name`: Target dataset for evaluation (e.g., MSCOCO_i2t, ImageNet_1K)

2. Run the evaluation script:

```bash
bash eval.sh
```

---

## Acknowledgement

We thank the authors and acknowledge our work is inspired by code from the following repositories:

- [VLM2Vec](https://github.com/TIGER-AI-Lab/VLM2Vec)
- [B3](https://github.com/raghavlite/B3)
