# TDBench Evaluation Results - Supplemental Materials

## Overview

This repository contains the aggregated evaluation results from running experiments on the TDBench benchmark. Due to the large size of raw experimental outputs, we provide aggregated results that combine evaluations across 9 main categories and visual grounding tasks.

## Directory Structure

- `aggregated_results/` - Contains the aggregated experimental results, combining both 9 categories and visual grounding tasks
- `calculate_metrics.py` - Script to generate the comprehensive evaluation table
- `aggregate_model_metrics.py` - Script to compute model-wise averages across categories

## Scripts Description

### calculate_metrics.py
Produces the full evaluation table containing:
- **60 models** evaluated across **10 categories**
- Complete set of evaluation metrics:
  - **RE** (RotationalEval) - Consistency across all rotations
  - **VE mean** (VanillaEval mean) - Average accuracy across rotations
  - **Miss all** - Failure rate across all rotations
  - **ACR** - Average Consistency Rate
  - **θ** (Theta) - Knowledge Coverage
  - **r** - Knowledge Reliability
  - **g** - Guess Success Rate 
  - **KII** - Knowledge Integrity Index

These metrics correspond to Tables 4-7 in the appendix of the paper.

### aggregate_model_metrics.py
Computes model-wise performance by:
- Averaging results across **10 categories**
- Generating the summary statistics shown in **Table 2** of the main paper (8 categories displayed)
- Providing a comprehensive view of each model's overall performance

## Data Format

The aggregated results combine evaluations from:
1. Nine main visual reasoning categories
2. Additional visual grounding tasks

Each model's performance is evaluated across multiple rotations (0°, 90°, 180°, 270°) to assess rotational robustness and consistency.

## Usage

To reproduce the evaluation tables:

```bash
# Generate full evaluation metrics (Tables 4-7)
python calculate_metrics.py

# Generate model-wise averages (Table 2)
python aggregate_model_metrics.py
```

## Note

Raw experimental outputs are not included due to their large size. The aggregated results provided here contain all necessary information for reproducing the analysis presented in the paper.

The TDBench dataset will be open sourced following publication.