# MultiHaystack: Benchmarking Multimodal Reasoning Over 40K Images, Videos, and Documents

This repository contains the implementation for MultiHaystack, a comprehensive benchmark for evaluating multimodal retrieval and visual question answering systems across diverse content types including images, videos, and documents.

## Repository Structure

```
├── retrieval/          # Multimodal retrieval models
├── VQA/               # Visual question answering models
├── dataset/           # Dataset directory
└── Multihaystack.json # Benchmark dataset
```

## Configuration

Each model includes a centralized configuration section at the top of the file for easy customization:

- **Dataset paths**: Modify `DATASET_PATH` and `INPUT_JSON`
- **Model parameters**: Adjust generation settings, batch sizes, etc.
- **Output settings**: Configure result file names and formats

## Dataset Format

The benchmark uses JSON format with the following structure:
```json
{
  "conversations": [
    {"from": "human", "value": "question"},
    {"from": "gpt", "value": "answer"}
  ],
  "positive": ["relevant_file1.jpg", "relevant_file2.mp4"]
}
```