<div align="center"><h1>&nbsp;PEARL: Parallel Speculative Decoding with Adaptive Draft Length</h1></div>

<br>

> TL; DR: we introduce **PEARL** (Parallel spEculative decoding with Adaptive dRaft Length) to further reduce the inference latency of Large Language Models (LLMs). PEARL is a **parallel** inference framework based on speculative decoding which utilizes *pre-verify* and *post-verify* to achieve adaptive draft length.




## preparation

Follow the instructions below to prepare for reproducing the results in the paper.

1. experimental environment: `sh install.sh` will install the necessary packages in the project.
2. code changes: changes the code `src/util.py` line 37-46, to fill in your model paths.




## Examples

You can try this code with a simple command:

```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 2 benchmark/eval_humaneval.py --eval_mode para_sd --gamma 5 -n 1  -e test --draft_model codellama-7b --target_model codellama-70b --max_tokens 1024 --temp 0
```