# Skeleton-of-Thought: SkeLarge Language Models Can Do Parallel Decoding

The fully sequential decoding is a major cause of the long generation latency of LLMs. To address this, we take inspiration from the organized human thinking process to question the common assumption that LLMs have to do fully sequential decoding. Instead of employing a fully sequential approach, the Skeleton-of-Thought (SoT) method first guides LLMs to generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. SoT can provide considerable end-to-end speed-ups and can also potentially improve the answer quality on some question categories. And in order to make the overall solution more practical, an extension, SoT with router (SoT-R), employs a GPT-4-prompting router or a trained RoBERTa router to only trigger SoT for suitable questions.

The repo is organized as follows.
* The SoT implementation is under [`sot/`](sot/).
* The SoT prompts are given under [`prompts/`](prompts/). For example, `sot_opensource.json` is used for all open-source models, and `sot_gpt4` is used for the GPT-4 API.
* The processed data are under [`data/`](data/).
* The scripts under [`scripts/`](scripts/) are used to dump, evaluate, and plot the results.
* The Gradio demo code is under [`demo/`](demo/). The demo is built referring to the FastChat demo code.

## Install dependencies
```pip install -e .```

## Test SoT
The SoT gradio demo can be started as follows (under the [`demo/`](demo/) directory):

1. Launch the controller
```
python controller.py
```

2. Launch the model workers
```
CUDA_VISIBLE_DEVICES=0 python model_worker.py --model-path ${MODEL_NAME} --controller http://0.0.0.0:21001 --port 31000 --worker http://0.0.0.0:31000
CUDA_VISIBLE_DEVICES=1 python model_worker.py --model-path ${MODEL_NAME} --controller http://0.0.0.0:21001 --port 31001 --worker http://0.0.0.0:31001 --sot ../prompts/sot_opensource.json
```

3. Launch the demo
```
python gradio_web_server_multi.py
```

Then, we can visit the Gradio web URL to try SoT. Note that this demo is currently pure SoT, without support for SoT-R. We'll integrate the support of SoT-R into the demo soon.

## Evaluate SoT
### Prepare the dataset
The data and the pre-processing scripts of Vicuna-80, WizardLM, and LIMA are provided under [`data/`](data/).

### Dump the answers of SoT and Normal decoding
We put the answer dumping scripts for the Vicuna-80 and WizardLM datasets under [`scripts/vicuna/dump/`](scripts/vicuna/dump/) and [`scripts/wizardlm/dump/`](scripts/wizardlm/dump/).

For example, to dump SoT answers of the open-source model named `MODEL_NAME`, we can run
```
bash scripts/vicuna/dump/opensource_outline.sh ${MODEL_NAME} prompts/sot_opensource.json results/vicuna/vicuna_${MODEL_NAME}_outline --num-gpus ${NUM_GPUS}
```
`MODEL_NAME` can be a local path or a Huggingface endpoint (see Appendix A in the paper for the endpoints we use).

To dump the Normal answer, we can run
```
bash scripts/vicuna/dump/opensource_naive.sh ${MODEL_NAME} none results/vicuna/vicuna_${MODEL_NAME}_naive --num-gpus ${NUM_GPUS}
```

### Evaluate the answer quality
We put the evaluation scripts for the Vicuna-80 and WizardLM datasets under [`scripts/vicuna/dump/`](scripts/vicuna/dump/) and [`scripts/wizardlm/dump/`](scripts/wizardlm/dump/).

The evaluation scripts use the comparison prompts provided by Fastchat or LLMZoo to prompt a GPT-4 judge to compare the quality of two answers.

### Plot the figures in the paper
Coming soon...

## Develop SoT
### Manually tune the SoT prompts
`sot/prompt_eng_main.py` is a helper program to ease manual prompt tuning. Use `bash scripts/debug_prompt.sh <model name or path>` to run the script. This will pop an interactive session in which you can run the following commands:

1. `use <data filepath>` to load data (default: `data/vicuna/data.csv`)
2. `useprompt <prompt filepath>` to change the SoT prompt templates (default: `prompts/sot_opensource.json`)
3. `usenaiveprompt <prompt filepath>` to change the normal prompt template (default to use only the question)
4. `test <ind>` to test SoT decoding for the ind-th question, and `test naive <ind>` to test normal decoding
5. `exit` to exit the session

The model outputs will be streamed onto the console (note that the expansion of multiple points is conducted sequentially). After a complete test, statistics will be printed. At any time during the generation, one can push Ctrl+C to abort the generation to go back to the interactive session.

> Note:
> 1. We mainly use this program to help engineer the prompt for the open-source models.
> 2. Any other command-line arguments for the model can be fed as the arguments to this script. For example, as testing a 13B model on RTX 3090 with FP16 inference requires two GPUs, we can run
> ```bash scripts/debug_prompt.sh meta-llama/Llama-2-13b-chat-hf --num-gpus 2```

### Train the router for SoT-R
Coming soon...

## Acknowledgement
During the development of SoT, we use and refer to the amazing work of [FastChat](https://github.com/lm-sys/FastChat) and [Hugging Face transformer package](https://github.com/huggingface/transformers/).
