# PPD (Predictive Pipelined Decoding)

This is the implementation for [Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding](https://openreview.net/forum?id=yUmJ483OB0).

## Requirements

It requires a total of K+1 GPU resources to execute PPD. 
We conducted the experiment using four A100-SXM4-40GBs.
Also, We implement the code based on DGX server.
The version of Nvidia driver is 515.86.01. (CUDA Version: 11.8)

`Environment Information`
```
OS: Ubuntu 20.04.6 LTS
Allocated CPU core(s): 120
NVLink BW: 25GB/s (12 Links in each GPU)
```

## Getting started

```bash
conda env create -f environment.yaml
```

## Usage
### Greedy decoding
```bash
python greedy_decoding.py
```

### Greedy decoding with PPD
```bash
# Run with 1 sub-process
python PPD.py --num_of_process 1
# Run with 3 sub-processes
python PPD.py --num_of_process 3
```

## Citation
```bibtex
@misc{yang2023predictive,
      title={Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding}, 
      author={Seongjun Yang and Gibbeum Lee and Jaewoong Cho and Dimitris Papailiopoulos and Kangwook Lee},
      year={2023},
      eprint={2307.05908},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```