---
language:
- en
license: cc-by-4.0
size_categories:
- 1K<n<10K
task_categories:
- question-answering
- visual-question-answering
- multiple-choice
pretty_name: MRAG-Bench
dataset_info:
  features:
  - name: id
    dtype: string
  - name: aspect
    dtype: string
  - name: task
    dtype: string
  - name: image
    dtype: image
  - name: gt_images
    sequence: image
  - name: question
    dtype: string
  - name: choices
    sequence: string
  - name: answer_choice
    dtype: string
  - name: answer
    dtype: string
  - name: image_type
    dtype: string
  - name: label
    dtype: string
  - name: source
    dtype: string

configs:
- config_name: default
  data_files:
  - split: test
    path: data/test-*
---


# MRAG-Benchmark

## This is only a sample of our benchmark to satisfy the 100MB limit of supplymentary material. 

# README: Loading and Visualizing Image Examples from Parquet Files

## Overview

This project contains Parquet files split into chunks, with each file containing data about images, questions, and answers. 

## File Structure

The Parquet files are saved in chunks, each named in the following format:
test-0000{i}-of-000014.parquet. Here we only provide test-000013-of-000014


The Parquet files are saved in the `data/` directory.

Each record contains fields such as:
- `id`: Unique identifier for the example.
- `aspect`: Aspect type for the example (e.g., 'Perspective').
- `task`: Type of task (e.g., 'Angle').
- `question`: Question to be answered (e.g., 'What is the species of this animal?').
- `choices`: List of possible answers.
- `answer_choice`: Correct choice identifier (e.g., 'A').
- `answer`: Correct answer (e.g., 'basenji').
- `gt_images`: Array containing image information (stored as bytes).
- `image_type`: Type of image (e.g., 'Animal').
- `label`: The label of the image (e.g., 'basenji').
- `source`: Source of the image (e.g., 'Imagenet').
- `image`: Contains image data in byte format.

## Requirements

To load the Parquet files and visualize the image data, you'll need the following libraries:

```bash
pip install pandas pyarrow matplotlib pillow
```

## Loading and Processing the Data
### Step 1: Loading the Parquet Files

To load all the Parquet files and combine them into a single DataFrame, you can use the following code:

```python
import pandas as pd
import os

# Path to the directory containing the Parquet files
parquet_dir = 'your current path + data/'

# List all Parquet files in the directory
parquet_files = [os.path.join(parquet_dir, f) for f in os.listdir(parquet_dir) if f.endswith('.parquet')]

# Load all Parquet files into a single DataFrame
df = pd.concat([pd.read_parquet(f) for f in parquet_files], ignore_index=True)

print(f"Loaded {len(df)} rows from {len(parquet_files)} Parquet files.")
```

### Step 2: Displaying and Plotting an Image

Each image is stored in byte format in the image field of the dataset. To visualize an image, you can convert the bytes into an image using the Pillow library and then plot it using matplotlib.

Here's an example of how to display a random image from the dataset:

```python
from PIL import Image
import io
import matplotlib.pyplot as plt
import random

# Select a random row with valid image bytes
row = df.sample(1).iloc[0]

# Extract the image bytes from the selected row
image_bytes = row['image']['bytes']

# Convert the bytes into an image
image = Image.open(io.BytesIO(image_bytes))

# Plot the image along with its label and question
plt.imshow(image)
plt.title(f"Label: {row['label']}\nQuestion: {row['question']}")
plt.axis('off')  # Hide axis
plt.show()
```

### Step 3: Plotting Multiple Images

If you want to plot multiple images at once, you can modify the above code to select multiple rows and plot them in a grid layout:

```python 
import numpy as np

# Number of images to display
num_images = 4

# Select random rows
sample_rows = df.sample(num_images)

# Plot the images in a grid
fig, axes = plt.subplots(1, num_images, figsize=(15, 5))

for i, row in enumerate(sample_rows.itertuples()):
    # Extract the image bytes
    image_bytes = row.image['bytes']
    
    # Convert the bytes to an image
    image = Image.open(io.BytesIO(image_bytes))
    
    # Plot the image
    axes[i].imshow(image)
    axes[i].set_title(f"Label: {row.label}")
    axes[i].axis('off')  # Hide axis

plt.show()
```