# SPEED-Bench Measurement Framework

This repository contains the supplementary material for the SPEED-Bench paper, specifically the lightweight measurement framework for benchmarking speculative decoding (SD) across multiple inference engines, using the SPEED-Bench dataset.

The framework operates as a thin client that performs tokenization and prompt formatting externally and transmits pre-tokenized inputs so all engines process identical token sequences. It is built around an asynchronous `asyncio` event loop for concurrent request dispatch, extracts fine-grained timing and acceptance signals from streaming responses to compute metrics such as TTFT, step latency, and throughput, and provides native integrations with SGLang, vLLM, and TensorRT-LLM to evaluate SD under realistic deployment optimizations.

The framework also supports integration with the SpecBench framework, allowing to evaluate methods using custom HuggingFace/PyTorch code for research purposes. See [SpecBench Porting](SPECBENCH_PORTING.md) for more details.

Furthermore, the framework supports other datasets for evaluation, such as SpecBench and MT-Bench.

The main entrypoint of the the framework is the `run.py` script, details on key parameters are provided below.

## Getting Started

First, you should run a docker image of the desired inference engine. For the experiments in the paper we used the following docker images:

1. **TensorRT-LLM**: [nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release?version=1.2.0rc1), [nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc7](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release?version=1.2.0rc7)<sup>*</sup>.
2. **vLLM**: [vllm/vllm-openai:v0.13.0](https://hub.docker.com/layers/vllm/vllm-openai/v0.13.0/images/sha256-1ae45c63e0578cf8ea690299c20b6a32a5c3ddf659e92ea15c0ae6b8040ba741).
3. **SGLang**: [lmsysorg/sglang:v0.5.7](https://hub.docker.com/layers/lmsysorg/sglang/v0.5.7/images/sha256-34d728fd77f57ae62f5bf236239ed48774f1e96f8a293adf2e1e29bfe5949bbb)

<sup>*</sup> TensorRT-LLM 1.2.0rc7 was used only with T=1 for running GPT-OSS 120B and Llama 3.3 70B target models, using N-Grams and Eagle3 for drafting.

### Running SPEED-Bench

Install the requirements file using `pip install -r requirements.txt`

#### Target model Llama 3.3 70B Instruct with EAGLE3 drafter on the Qualitative Split (Table 1)

*Using TensorRT-LLM version 1.2.0rc1*

Create the file `runtime_args_long_context.yaml` to support longer context (>8192 tokens) in TesnorRT-LLM:
```yaml
engine_args:
  max_seq_len: 131072   # Model max context length (for Llama 3.3 70B)
  enable_chunked_prefill: true
```

Download the draft model from HuggingFace: [yuhuili/EAGLE3-LLaMA3.3-Instruct-70B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B) to a local directory `/path/to/yuhuili/EAGLE-LLaMA3-Instruct-70B`

Then run the execution command,
```bash
python3 run.py --model_dir meta-llama/Llama-3.3-70B-Instruct --tokenizer meta-llama/Llama-3.3-70B-Instruct --draft_model_dir /path/to/yuhuili/EAGLE-LLaMA3-Instruct-70B --dataset speed --dataset_path qualitative --tp_size 1 --ep_size 1 --draft_length 3 --output_length 4096 --engine TRTLLM --concurrency 32 --runtime_params runtime_args_long_context.yaml --show_progress --save_dir results/speed_llama_3.3_70b_eagle3_tensorrtllm_dl3_bs32
```

#### Target model GPT OSS 120B with EAGLE3 drafter on the Throughput Split (Figure 7)

Download the draft model from HuggingFace: [nvidia/gpt-oss-120b-Eagle3-long-context](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3-long-context) to a local directory `/path/to/nvidia/gpt-oss-120b-Eagle3-long-context`

*Using vLLM version 0.13.0*

```bash
for batch_size in 2 4 8 16 32 64 128 256 512
do 
  python3 run.py --model_dir openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --postprocess gptoss --draft_model_dir /path/to/nvidia/gpt-oss-120b-Eagle3-long-context --dataset speed --dataset_path throughput_2k --tp_size 1 --ep_size 1 --draft_length 3 --output_length 4096 --ignore_eos --engine VLLM --concurrency ${batch_size} --show_progress --save_dir results/speed_throughput_2k_gptoss120b_eagle3_vllm_dl3_bs${batch_size}
done
```

Run without SpecDec:

```bash
for batch_size in 2 4 8 16 32 64 128 256 512
do 
  python3 run.py --model_dir openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --postprocess gptoss --speculative_algorithm NONE --dataset speed --dataset_path throughput_2k --tp_size 1 --ep_size 1 --draft_length 1 --output_length 4096 --ignore_eos --engine VLLM --concurrency ${batch_size} --show_progress --save_dir results/speed_throughput_2k_gptoss120b_vllm_dl3_bs${batch_size}
done
```

### Running Random Tokens on GPT OSS + Eagle3

Download [nvidia/gpt-oss-120b-Eagle3-long-context](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3-long-context) to a local directory `/path/to/eagle`.

```bash
python3 run.py --model_dir openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --postprocess gptoss --draft_model_dir /path/to/eagle --random_isl 1024 --tp_size 1 --ep_size 1 --draft_length 3 --output_length 4096 --num_requests 40 --engine TRTLLM --concurrency 1 --ignore_eos
```

### Running MTBench on GPT OSS + Eagle3

A basic example run script is provided which benchmarks MT-Bench.
MT=Bench is available [here](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts)

Download [nvidia/gpt-oss-120b-Eagle3-long-context](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3-long-context) to a local directory `/path/to/eagle`.

```bash
python3 run.py --model_dir openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --draft_model_dir /path/to/eagle --mtbench question.jsonl --tp_size 1 --ep_size 1 --draft_length 3 --output_length 4096 --num_requests 80 --engine TRTLLM --concurrency 1 --postprocess gptoss
```


## Key Parameters

The main entrypoint `run.py` exposes a set of CLI flags to select the models, dataset, inference engine, speculation algorithm, and measurement settings. 
Documentation of key parameters can be found below, use `--help` for more advanced parameters.

### Model and engine selection
- **`--engine`**: Inference backend to use. Choices: `TRTLLM`, `VLLM`, `SGLANG`, `SPECBENCH_MEDUSA`.
- **`--model_dir` (required)**: Target model identifier/path (e.g., a HuggingFace model name or local path), passed to the selected engine wrapper.
- **`--tokenizer` (required)**: Tokenizer identifier/path used by the client for prompt formatting + tokenization.
- **`--tp_size` / `--ep_size`**: Parallelism configuration forwarded to the engine wrapper (tensor parallel size / MoE expert parallel size).
- **`--postprocess`**: Text postprocessing applied to decoded outputs. Choices: `base`, `gptoss`.

### Speculative decoding configuration
- **`--speculative_algorithm`**: Speculation method. Choices: `EAGLE3`, `EAGLE`, `DRAFT_TARGET`, `NGRAM`, `MTP`, `NONE`.
- **`--draft_model_dir`**: Draft model identifier/path (required by most speculative methods such as `EAGLE3` / `EAGLE`; not needed for `NONE`, `MTP` and `NGRAM` speculative algorithms).
- **`--draft_length`**: Number of speculative steps/tokens per iteration.
- **`--output_length`**: Maximum number of output tokens to generate per request/turn.

### Dataset selection
- **`--dataset`**: One of `speed`, `specbench`, `mtbench`, `random`.
- **`--dataset_path`**: A local path of dataset files.
- **`--num_requests`**: Optional cap on the number of requests/samples from the chosen dataset.

### Concurrency and progress reporting
- **`--concurrency`**: Maximum number of concurrent request (batch size).
- **`--show_progress`**: Enable progress bar.

### Metrics and logging outputs
- **`--save_dir`**: If set, write metrics to this directory.

### Advanced: `--runtime_params` YAML
Use `--runtime_params /path/to/config.yaml` to pass structured configuration. `run.py` looks for the following top-level keys:
- **`chat_template_args`**: Forwarded to `encode_chat(...)` to control chat templating (e.g, reasoning effort).
- **`dataset_kwargs`**: Passed to the dataset constructor (e.g., input_len for `random` dataset).
- **`engine_args`**: Forwarded to the engine wrapper constructor (e.g., max sequence length, chunked prefill).
- **`sampling_kwargs`**: Sampling configuration passed to the engine wrapper (e.g., temperature).

## Data Release Note
The SPEED-Bench dataset is not included directly in this supplementary material due to size limitations and to remain compliant with the licenses and agreements of all original data sources used. 

The dataset is constructed using a hybrid mechanism:
* **Direct Sampling:** For sources with permissive licenses, SPEED-Bench will include the sampled data directly.
* **Scripted Construction:** For sources with more restrictive licenses, we provide automated gathering and construction scripts (also available within this framework). These scripts allow for local reconstruction of the data while respecting the original terms of service.

Upon paper acceptance, we will release the consolidated dataset under the relevant license.

## Code License
The code provided in this supplementary material is derived from an open-source repository licensed under the **Apache License 2.0**. This code remains under the Apache 2.0 license. In accordance with double-blind review policy, the source repository will be disclosed upon acceptance of the publication.
