<div align="center">
    <h1>InferSpec: Adaptive Inference-Time Compute with Ensemble Verifier-Guided Speculative Decoding for Efficient Reasoning</h1>
</div>


## Introduction

<p float="left" align="middle">
  <img src="./imgs/overview.png" width="750">
</p>

We introduce Verifier-Guided Speculative Decoding, a novel framework aimed at improving the efficiency of inference in large language models (LLMs). We employ Log Probability-Based and Attention-Based Grounding Verification to evaluate intermediate decoding steps from draft model, and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. Extensive evaluations on challenging reasoning benchmarks, show that our method delivers **significant efficiency and performance gains** against decoding with the target model only.

## Support
- [x] **vLLM online mode**: Need at least 2 GPUs to serve the draft and target, since vLLM doesn't support serving multiple models on 1 GPU.

## Installation
```shell
# For math evaluation
pip install -r requirements.txt
```

## Efficient Decoding
**1. Preparation**

We mainly use [Qwen2.5-Math family](https://huggingface.co/collections/Qwen/qwen25-math-66eaa240a1b7d5ee65f1da3e). You need to change ``max_position_embeddings`` in their config.json from 4096 to 16384, which aims to avoid max_tokens error in vLLM. We only use the generation shorter than 4096, so this change won't affect the performance.

**2. Model serve**
```shell
bash scripts/serve_draft_model.sh
bash scripts/serve_target_model.sh
```

**3. Evaluation**
```shell
bash scripts/math_eval_ensemble_majority.sh
````

## Acknowledgement
Our code base mainly builds on [RSD](https://github.com/BaohaoLiao/RSD).

