# MI-RAG: Multimodal Iterative Retrieval-Augmented Generation for Knowledge VQA

This repository contains the implementation for the paper "Multimodal Iterative Retrieval-Augmented Generation for Knowledge VQA" submitted to ICLR 2026.

## Overview

MI-RAG is a multimodal RAG system that combines visual and textual information through iterative retrieval-augmented generation for enhanced knowledge-based visual question answering. 

## Setup and Installation

### Prerequisites

- Python 3.8+
- CUDA-compatible GPU
- Required Python packages:
  ```bash
  pip install torch torchvision open_clip_torch sentence-transformers faiss-cpu pandas numpy pillow transformers
  ```

### Data Preparation

#### 1. Download InfoSeek Dataset and Wikipedia Data

Download the required datasets from the InfoSeek repository:

```bash
# Wikipedia data (6.9GB)
wget http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
gunzip Wiki6M_ver_1_0.jsonl.gz

# annotations
# annotations prepared in jsonl folder. Modify image_path according to prepared image of each dataset (Encyclopedic VQA, InfoSeek, OK-VQA)
```

#### 2. Download Wikipedia Images

Follow the image download instructions from the [OVEN evaluation repository](https://github.com/edchengg/oven_eval/tree/main/image_downloads) to download Wikipedia images using the `wikipedia_image_url` field in the dataset.

#### 3. Update Data Paths

Modify the paths in `create_index.py`:
- Set `JSONL_FILE` to your `Wiki6M_ver_1_0.jsonl` path
- Set `OUTPUT_DIR` to your desired index storage location

## Evaluation Pipeline

### Step 1: Build Retrieval Index

Generate the mixed-modal retrieval index from Wikipedia data:

```bash
python create_index.py
```

This will:
- Process Wikipedia images and text in chunks
- Generate SigLIP image embeddings and GTE text embeddings
- Create a FAISS index named `mixed_index_large.index`
- Generate metadata CSV files

### Step 2: Validate Retrieval Performance

Before running the full MI-RAG pipeline, verify the retrieval index performance:

```bash
python evaluate_retrieval.py --data jsonl/infoseek_val_5k.jsonl --topk 10
```

### Step 3: Run MI-RAG Evaluation

Execute the main MI-RAG pipeline with different models and configurations:

#### InfoSeek Dataset Evaluation

```bash
# Gemma model evaluation
CUDA_VISIBLE_DEVICES=3 python main.py \
    --data /drl_nas2/ckddls1321/data/Encyclopedic_VQA/test_1k.jsonl \
    --model openrouter:google/gemma-3-4b-it \
    --verbose \
    --itercount 4 \
    --image-search \
    --ask \
    --topk 10

# Gemini model evaluation  
python main.py \
    --data /drl_nas2/ckddls1321/data/Encyclopedic_VQA/test_1k.jsonl \
    --model openrouter:google/gemini-2.5-flash \
    --verbose \
    --itercount 4 \
    --image-search \
    --ask \
    --topk 20
```

### Step 4: Evaluate Results

#### InfoSeek Dataset (CEM Evaluation)

```bash
python direct_eval.py \
    --pkl outputs/gemini-2.5-flash_InfoSeek_infoseek_val_5k_da_False_iter_4_ask3_image_search.pkl
```

#### Encyclopedic VQA Dataset (BEM Evaluation)

```bash
python direct_eval.py \
    --pkl outputs/gemini-2.5-flash_Encyclopedic_VQA_test_5k_da_False_iter_4_ask3_image_search.pkl \
    --bem
```

## Key Arguments

### main.py Arguments

- `--data`: Path to input JSONL file
- `--model`: Model name (supports OpenAI, Gemma, Gemini, etc.)
- `--itercount`: Number of iterative reasoning steps (default: 0)
- `--topk`: Number of passages to retrieve (default: 10)
- `--image-search`: Enable image-based retrieval
- `--ask`: Enable iterative follow-up questions
- `--verbose`: Enable verbose output

### direct_eval.py Arguments

- `--csv`: Path to csv file with model outputs
- `--pkl`: Path to pickle file with model outputs
- `--bem`: Use BEM evaluation (for Encyclopedic VQA)

## Output Files

The evaluation generates several output files:
- Pickle files in `outputs/` directory containing model predictions
- CSV files with detailed results
- Metadata files for retrieval analysis

## Model Support

The framework supports multiple models through the API wrappers:
- OpenAI models (GPT-3.5, GPT-4)
- Google models (Gemini, Gemma)
- Other models(Qwen, InternVL, Ovis)

