# IDentify

### **Unified Multi-Modal Interleaved Document Representation for Information Retrieval**

# Scripts

## Retriever Train
```bash
bash scripts/retrieval/encyclopedic_vqa/train_retriever.sh
```

## Reranker Train
```bash
bash scripts/retrieval/encyclopedic_vqa/train_reranker.sh
```

## Multimodal Query embedding extract
```bash
bash scripts/retrieval/encyclopedic_vqa/extract_embeds/embed_multimodal_query.sh checkpoint_name 
```

## Document embedding extract
```bash
bash scripts/retrieval/encyclopedic_vqa/extract_embeds/embed_interleaved_doc.sh checkpoint_name 
```

## Document Retrieval
```bash
bash scripts/retrieval/encyclopedic_vqa/info_retrieval/document/query_doc_retrieval.sh checkpoint_name 
```

## Section Retrieval
```bash
bash scripts/retrieval/encyclopedic_vqa/info_retrieval/section/multimodal_query_interleaved_sec_rerank.sh checkpoint_name 
```

# Installation

### 1. **Clone this repository and navigate to the LLaVA folder:**
```bash
git clone https://github.com/LLaVA-VL/LLaVA-NeXT Multimodal-Retrieval
cd Multimodal-Retrieval
```

### 2. **Install the inference package:**
```bash
conda create -n llava_next python=3.10 -y
conda activate llava_next
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"
```


# Base model

Downloaded from huggingface. ([LLaVA-Next-Interleave 0.5B](https://huggingface.co/lmms-lab/llava-next-interleave-qwen-0.5b)) <br />
The base model should be small to employ large batch-size for effective contrastive learning.

# Datasets

## [Encyclopedic-VQA](https://github.com/google-research/google-research/tree/master/encyclopedic_vqa)

Dataset for multimodal query (image + text) to interleaved Wikipedia documents (focused on landmarks, species). <br />
The paper provides interleaved Wikipedia documents, but requires preprocessing for our experiment purpose. <br />
Refer to the ./utils/encyclopedic_vqa/preprocess_order document for the preprocess codes running order.

## [InfoSeek](https://github.com/open-vision-language/infoseek)

Dataset for multimodal query (image + text) to interleaved Wikipedia documents (focused on much more diverse visual entities). <br />
The paper only provides fraction of text from the Wikipedia documents as KB. Hence, we need to download images and process 
the texts to represent interleaved document. <br />
The KB processing will be implemented soon.

## [ViQuAE](https://github.com/PaulLerner/ViQuAE?tab=readme-ov-file)

Dataset for query (image + text) to interleaved Wikipedia documents (focused on human entities). <br />
Similar to the Encyclopedic-VQA, it provides text-only queries corresponding to the image + text queries, <br />
as well as evidence section (provenance). However, 

## [WikiTableQuestions](https://github.com/ppasupat/WikiTableQuestions?tab=readme-ov-file)

Dataset for text query to tables from Wikipedia documents. It provides source Wikipedia URLs as
auxiliary metadata, so we need to modify the dataset to fit into our experiment setting (interleaved document with HTML table data). <br />

## [Open-WikiTable](https://github.com/sean0042/Open_WikiTable)

Dataset for text query to tables from Wikipedia documents. Unlike the WikiTableQuestions, the dataset is for open-domain QA, <br />
hence fit into our retrieval experiment. The dataset is based on the WikiTableQuestions (and WikiSQL) and thus we can <br />
use the Wikipedia URLs provided by the WikiTableQuestions to build interleaved documents for the Open-WikiTable.
Expected challenge of tables in interleaved document: 1. finding the correct document. 2. the tables from the same document might have
closely related contents that are hard to distinguish.