# E5-V: Universal Embeddings with Multimodal Large Language Models

## Overview
We propose a framework, called E5-V, to adpat MLLMs for achieving multimodal embeddings. E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We also propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs, demonstrating better performance than multimodal training.

![](figure/e5v.png)


## Training
1. Install Dependencies

``` sh
pip install -r requirements.txt
```

2. Download Data

``` sh
cd ./data
bash download_nli.sh
cd -
```

3. Transfer llava-llama-3-8b model to huggingface format on each nodes

``` sh
mkdir -p models
cd models
for i in 1 2 3 4; do
    wget https://huggingface.co/lmms-lab/llama3-llava-next-8b/resolve/main/model-0000$i-of-00004.safetensors
done
cd -
python load_llama3_hf.py
rm models/*.safetensors
```

4. Train
``` sh
bash run.sh
```

5. Test
Use `--lora_path` flag to test the results.
``` sh
accelerate launch --num_machines=1 --num_processes 8 --machine_rank 0 retrieval.py \
    --llava_llama3 --lora_path e5v-8b  --batch_size 1
```


## Acknowledgement
Our Code is based on SimCSE and alpaca-lora
