# A Drop-In Solution for On-the-Fly Adaptation of Speculative Decoding in Large Language Models

This paper introduces *on-the-fly adaption of speculative decoding*, a solution that dynamically adapts the choices to maximize the efficiency of speculative decoding for LLM inferences. As a drop-in solution, it needs no offline benchmarking or training. 

The code used for empirical experiments are available in this repository for validation and replication.

## Pre-requisites
This project was primarily developed with CUDA GPU. We'd recommend CUDA SM 8 (Ampere) or higher. 

There are Slurm batch scripts under `scripts/` that are specific towards the HPC cluster.

## Setup

```sh
conda create -n llm python=3.9
```

As always, activate the environment and install dependencies:
```sh
conda activate llm
pip3 install -r requirements.txt
```

## Datasets 
These are the datasets and domains used. For now, we do not use a generalized chat dataset. No need to download them manually, as they are automatically downloaded by the HuggingFace library.


## Models
Here are models experimented so far. Using the fine-tuned chat models could lead to more aligned and better results depending on the domain and prompt. I found that some prompt engineering is required to align the model with a goosd output. However, trying to do speculative decoding with two different models that have same architecture, but trained on different data could be a little problematic. For example, using TinyLLaMA and the official LLaMA model could lead to a lot of rejected tokens, because both models are trained by two different teams, and might be trained on different data. Also, the way they prompt their chat fine-tune models is a bit different. I'm taking some precautions. 




### How much memory do I need?

Memory requirements can exceed when using larger parameters. Below is estimated memory usage, when quantization is enabled for the LLaMA series. Use this to guage for other models.

| Model | Memory (GB) (Q8) |
|:-------:|:------:|
| LLaMA 7B   | 6.5 |
| LLaMA 13B | 12.5 |
| LLaMA 70B | 90 |



Note that OpenLLaMA tokenizers are not cross-compatible with Meta's or TinyLLaMA's tokenizers, despite the model architecture being the same. Do not mix models for speculative decoding.

Some commands common pairs
```sh
python3 main.py --target-model bigscience/bloom-7b1 --draft-model bigscience/bloom-560m -b --dataset gsm8k --mode sps 
python3 main.py --target-model google/gemma-7b --draft-model google/gemma-2b -b --dataset gsm8k --mode sps 
python3 main.py --target-model facebook/opt-13b --draft-model facebook/opt-125m -b --dataset gsm8k --mode sps 
```

You will need to survey and read the model's respective papers and technical reports to understand how the model is to be prompted. Some are fine-tuned model, and they all seem to be prompted differently. At least for decoder-only model, the repository should apply all the speculative decoding techniques out of the box, as long as the model is compatible with the HuggingFace library.

## Benchmarks

To debug implementations, you can just run:
```sh
python3 main.py
```

## Commands

For exploratory data analysis of the datasets, and experimentation with compression, here are the following commands.

```sh
python3 main.py --eda
```

```sh
python main.py \
    --input "<|system|>\nYou are given a math question, and your task is to answer it. Then provide a step-by-step walkthrough on how you got the answer to the question.\n<|user|>\nJames writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year?\n<|assistant|>\n" \
    --target_model_name openlm-research/open_llama_13b \
    --approx_model_name openlm-research/open_llama_3b
```


The *on-the-fly adaption of speculative decoding* is located in the source file.

## 
**Copyright**:  The copyright of this repository belongs to the authors of the ICLR'2025 paper submission (#3120). The purpose of this package is only for the assessment by the ICLR'2025 program committee during the paper review process; any other uses for any other purposes are prohibited.