<div align="center">
  
# Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

[![arXiv](https://img.shields.io/badge/arXiv-2510.20812-b31b1b.svg?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.20812)
[![Homepage](https://img.shields.io/badge/Repo-181717.svg?logo=github&logoColor=white)](https://github.com/Tinaliu0123/speculative-verdict)

The codebase of "[Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation](https://arxiv.org/abs/2510.20812)"

</div>

## 📝 Table of Contents

- [Overview](#overview)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Download Datasets](#download-datasets)
- [Quick Start](#quick-start)
- [Usage](#usage)
  - [Modes](#modes)
  - [Parameters](#parameters)
- [Advanced Usage](#advanced-usage)
  - [Using Annotated Images](#using-annotated-images)
  - [Adding Custom Models](#adding-custom-models)
- [Evaluation](#evaluation)

## Overview

![Method Overview](method.png)


## Project Structure

```
specverdict/
├── main.py                  # Main entry point for all pipeline stages
├── draft.py                 # Draft stage
├── verdict.py               # Verdict stage
├── consensus_scoring.py     # Consensus-based expert ranking
├── prompts.py               # Dataset-specific prompts
├── model.py                 # Model wrappers
├── utils/                   # Post-processing utilities
├── eval/                    # Evaluation framework
├── layout_annotation/       # Optional: OCR-based image annotation for information-intensive benchmarks
└── requirements.txt
```

## Installation

```bash
git clone https://github.com/Tinaliu0123/speculative-verdict.git
cd specverdict
conda create -n specverdict python=3.10
conda activate specverdict 

export OPENAI_API_KEY="your-key"
pip install -r requirements.txt
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
```

**Note:** Use `transformers>=4.57.1` for GLM-4.1V-Thinking. Use `transformers==4.53.0` for LLaVA-OneVision.

## Download Datasets

1. **InfographicVQA**: [InfographicVQA Dataset](https://www.docvqa.org/datasets/infographicvqa)
2. **ChartMuseum**: [🤗ChartMuseum Dataset](https://huggingface.co/datasets/lytang/ChartMuseum)
3. **ChartQAPro**: [🤗ChartQAPro Dataset](https://huggingface.co/datasets/ahmed-masry/ChartQAPro)
4. **HR-Bench**: [🤗HR-Bench Dataset](https://huggingface.co/datasets/DreamMr/HR-Bench)

**Note:** The following datasets require light preprocessing of the question text to satisfy task-specific formatting.
- ChartQAPro: Follow the official prompt in its [paper](https://arxiv.org/pdf/2504.05506) and **prepend** the required paragraph metadata when needed.
- HR-Bench 4K: For multiple-choice questions, **append** explicit A/B/C/D options immediately after the question.

## Quick Start

### Example: Complete Pipeline on InfographicVQA

```bash
# 1. Initial inference with 5 candidate models (QA mode)
python main.py \
    --mode inference \
    --inference_mode qa \
    --models path/to/model1 path/to/model2 path/to/model3 path/to/model4 path/to/model5\
    --dataset infovqa \
    --in_json data/infovqa/test.jsonl \
    --out_json results/qa_inference.json

# 2. Compute consensus scores
python main.py \
    --mode prefill_cross \
    --models path/to/model1 path/to/model2 path/to/model3 path/to/model4 path/to/model5\
    --in_json results/qa_inference.json \
    --out_json results/prefill.json

# 3. Select top-3 draft experts
python consensus_scoring.py \
    --input results/prefill.json \
    --output results/consensus.json \
    --top_k 3

# 4. Draft experts generate detailed reasoning (reason mode)
python main.py \
    --mode inference_from_topk \
    --inference_mode reason \
    --model_mapping '{"model1":"path/to/model1", "model2":"path/to/model2", ...}' \
    --consensus_file results/consensus.json \
    --in_json data/infovqa/test.jsonl \
    --out_json results/draft_reason.json

# 5. Final verdict synthesis
python main.py \
    --mode verdict \
    --verdict_backend gpt4o \
    --in_json results/draft_reason.json \
    --out_json results/verdict.json \
    --dataset infovqa \
    --annotated_folder data/annotations/infovqa/  # optional
# Notes:
# - set --verdict_backend to qwen or gpt4o
# - for qwen backend, add: --models path/to/verdict_model
# - for gpt4o, you can optionally set --verdict_openai_model (default: gpt-4o)
```

## Usage

### Modes

**Pipeline Modes:**
- `inference`: Generate reasoning/answers from multiple models
- `prefill_cross`: Compute consensus scores
- `inference_from_topk`: Inference only for selected draft experts
- `verdict`: Synthesize reasoning into final answer

**Inference Modes:**
- `qa`: Direct question answering
- `reason`: Step-by-step reasoning

### Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `--dataset` | Dataset type | `infovqa`, `museum`, `pro`, `hrbench` |
| `--inference_mode` | Inference type | `qa`, `reason` |
| `--models` | Model paths (space-separated) | `path/to/model1 path/to/model2`... |
| `--in_json` | Input file path | `data/test.jsonl` |
| `--out_json` | Output file path | `results/output.json` |
| `--start_idx` | Start from specific sample | `0` |
| `--max_entries` | Process N samples only | `100` |
| `--seed` | Random seed | `42` |
| `--merge_output` | Update and merge into existing file | flag |

**Verdict-specific:**
- `--annotated_folder`: Folder containing layout-annotated images for information-intensive tasks
- `--verdict_backend`: Verdict backend (`qwen` or `gpt4o`, default: `gpt4o`)
- `--verdict_openai_model`: OpenAI model name when using `gpt4o` backend
- `--verdict_api_key`: Optional OpenAI API key (otherwise uses `OPENAI_API_KEY`)

**Consensus-specific:**
- `--top_k`: Number of models to select (default: 3)

## Advanced Usage

### Using Annotated Images

Generate and use layout-annotated images for information-intensive tasks:
```bash
# 1. Generate layout-annotated images
python layout_annotation/pipeline.py \
    --input data/infovqa/images/ \
    --output data/infovqa/annotated/

# 2. Use in verdict stage
python main.py \
    --mode verdict \
    --annotated_folder data/infovqa/annotated/ \
    ...
```

See [layout_annotation/README.md](layout_annotation/README.md) for details.

### Adding Custom Models

We currently support: Qwen2.5-VL, GLM-4V, MiMO, InternVL3/3.5, Ovis2.5, LLaVA-OneVision, Eagle2.5, Gemma3.

To add your own model, modify two files:

1. **`draft.py`**: Add model loading logic in `load_vlm()`
2. **`model.py`**: Create wrapper class with:
- `answer(img_path, question, prompt_tpl)` → Generate response
- `prefill_nll(img_path, question, answer)` → Conduct masking and compute perplexity score 

See existing implementations for reference.

## Evaluation

Evaluate final results against ground truth, following each dataset's evaluation metrics:

```bash
# InfographicVQA (ANLS metric)
python eval/eval.py infovqa \
    --input results/verdict.json \
    --output eval_results.json

# ChartQAPro (Relaxed accuracy)
python eval/eval.py chartqapro \
    --input results/verdict.json \
    --meta data/chartqapro/metadata.jsonl

# ChartMuseum (GPT-based scoring)
python eval/eval.py chartmuseum \
    --input results/verdict.json

# HR-Bench (Accuracy)
python eval/eval.py hrbench \
    --input results/verdict.json \
    --bench data/hr_bench.jsonl
```

See [eval/README.md](eval/README.md) for detailed evaluation documentation.


## Citation

If you find this work useful, please cite our paper:
```bibtex
@misc{liu2025smalldraftsbigverdict,
      title={Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation}, 
      author={Yuhan Liu and Lianhui Qin and Shengjie Wang},
      year={2025},
      eprint={2510.20812},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.20812}, 
}
```
