# Explanation for Supplementary Material
## The structure of files
vllm_partial folder contains the Cronus implementation except the frontend.
vllm_benchmark folder contains the implementation of the frontend and benchmarks we used.


## Cronus implementation
The Cronus is based on a developing branch of vLLM 6.1.post2. 
In this branch, there is a original disaggregated prefill of vLLM.
Later in vLLM 7.0, a similar implementation of the disaggregated prefill from the same author of this branch is merge into vLLM as an experiment feature.

We make following modification on top of this branch:
- make the KV cache transfer asynchronous
- allow extra field in the openAI entrypoint
- add extra metadata from each request about its partial prefill length (we name it as computed_len)
- continue chunked prefill for requests whose partial prefill length (computed_len) does not equal to its prompt length (input length)
- prompt truncate in this implementation also truncate the later part of input
- We use TCPStore the transfer data for balancer from chunked prefill instance to frontend. Environment variable FRONTEND_TCPSTORE_PORT and FRONTEND_TCPSTORE_IP need to set to enable this feature.

The computed_len should only be set when request is sent to chunked prefill instance. 
When the request is sent to partial prefill instance, the partial prefill length should be set through truncate_prompt_tokens field.

The vLLM we modified is in vllm_partial folder
The frontend for Cronus is in vllm_benchmark/disagg_benchmarks/disagg_prefill_proxy_server_store.py


There are many work around in our implementation, so the code is far from production ready.
Except vLLM features we used in our benchmark, we cannot guarantee that features provided by vLLM 6.1 work in our implementation.

## Install for Experiments
For experiment of Cronus, DP, disagg HL, and disagg L-h:
- Create a conda environment (in our bash script this environment is called vllm-disagg)
- install python3.18 and pip3 through conda
- install socat through conda
- install quart, httpx, and aiohttp through pip3
- install use pip3 install the modified in vllm_partial folder. It is prefered to install as editable.

For experiment of PP:
- Create another conda environment (in our bash script this environment is call vllm-pip6)
- install python3.18 and pip3 through conda
- install socat through conda
- install quart, httpx, and aiohttp through pip3
- install use pip3 install vLLM 6.1
- Change the openai protocol to allow extra field.
    - you need to find where python files of vLLM is install. You can use "pip show vllm" to find the location of the "site-packages" folder. vLLM is install in the "vllm" folder under the "site-packages" folder. 
    - the file you need to change is vllm/entrypoint/openai/protocol. Change the string "forbid" in OpenAIBaseModel to "allow".

## Data for Experiments
We use the conversation trace from Microsoft Azure Trace 2023.
The original trace is in vllm_benchmark/data/AzureLLMInferenceTrace_conv.csv.
In our experiment, we modified the timestamp of the trace so that the time interval better traces are fix.
The modified traces are in AzureConv_rps1.csv.
Later on, during experiment, we scaled the timestamp to achieve different request rate.

## Experiments

We modified the benchmarks in the vLLM repo to run our benchmarks.
The folder vllm_benchmark is the benchmark folder in the vLLM folder with our modification.

All the bash scripts of our experiments is in vllm_benchmark/disagg_benchmarks.
When you launch these jobs, please make sure you are in this directory.

We run all our jobs on a SLURM cluster. 
In the custer the node with A10 has feature "a10", the node with A30 has feature "a30", and the node with A100-80GB has feature "a100-80gb".

## TTFT P99, TBT P99, and Throughput
All the bash scripts involve in the TTFT P99, TBT P99, and throughput measurement are:
- run_all.bash
- launch_benchmark_trace.sh
- For DP
    - run_dp_test_\*.bash
    - launch_hetero_chunked1.sh
    - launch_hetero_chunked2.sh
- For PP
    - run_pp_test_\*.bash
    - launch_pp_ray_head.sh
    - launch_pp_ray_worker.sh
- For disaggregated prefill
    - run_hetero_test_HL_\*.bash
    - run_hetero_test_LH_\*.bash
    - launch_hetero_decode_null.sh
    - launch_hetero_prefill.sh
- For Cronus
    - run_hetero_test_\*.bash
    - launch_hetero_decode.sh
    - launch_hetero_prefill.sh

All run_*.bash script, except run_all.bash, are script for submitting SLURM jobs. 
They should be called using sbatch command.
You may also want to adjust the feature constraints (i.e. -C) and the queue name (i.e. -A) in these script to match the names in your SLURM cluster.

Script run_all.bash submits all the jobs to SLURM cluster.

Script the launch_benchmark_trace.sh runs the benchmark with different request rate.
All the jobs use this script to run benchmarks, so you can adjust the for loop in this script to adjust request rate.
Notice that the variable in the for loop is the time scale. It is the inverse of the request rate.
The output launch_benchmark_trace.sh will be store in the last_results folder as the result.

The launch_benchmark_trace.sh in this zip file is set to measure TTFT P99 and TBT P99.
To measure the maximum throughput, please set the scale to a really small number. 
In our experiment we set the scale to 0.0000001.

If you conda environment names are different from what I mention in the Install for Experiments section, 
you need to adjust launch_\*.sh scripts accordingly. 

## Measure the maximum throughput of the prefill instance and decode instance
In the appendix, to calculate the relative GPU utilization, we measure the the throughput of prefill instance and decode instance in disaggregated prefill. 
### Measure maximum throughput of the prefill instance
We can measure the maximum throughput of the prefill instance, we can just skip the decode instance in the frontend.
To do so, replace "disagg_prefill_proxy_server_store_null.py" in the launch_hetero_decode_null.sh file to "disagg_prefill_proxy_server_store_null_prefill_only.py".
Then when you run job with run_hetero_test_HL_\*.bash or run_hetero_test_LH_\*.bash, the throughput measured is the throughput of the prefill instance.
### Measure maximum throughput of the decode instance
To measure maximum throughput of the decode instance is more complicated, as we need to change vLLM itself.
What we did is to skip the all KV cache transfer expect the first only, and generated fake KV cache for decode instance after the first requests.

We adjust the vLLM to do this. You need to use the vLLM implementation in vllm_decode_xp folder.
If you install previous Cronus implement in editable mode, you can just replace the content of vllm_partial with the content in vllm_decode_xp.  

In addition you need to adjust the hidden_size in vllm_decode_xp\vllm\worker\model_runner.py line 1719 to match the model so that the fake data generated in decode is correct. 
The hidden_size for Llama3-8B is 4096. The hidden_size for Qwen2-7B is 3584.

And then you need to replace "disagg_prefill_proxy_server_store_null.py" in the launch_hetero_decode_null.sh file to "disagg_prefill_proxy_server_store_null_decode_only.py".

When you launch job with run_hetero_test_HL_\*.bash or run_hetero_test_LH_\*.bash, you also need to change the benchmark_name from trace to trace_measure_decode.

You will get two results from each *.log file in the last_result folder, the throughput of the second result is the trhoughput of the decode instance.

## Profiling prefill execution time and chunked prefill interation execution time.
In paper we mention that we model the partial prefill time and chunked prefill iteration execution time using profiling data. 
To do this, we add extra logging in the vLLM to measure corresponding time.
The vLLM with extra logging is in the "vllm_profile" folder.
The logging will be writen to "metrics.log" if environment variable METRIC_LOG is not set.
Otherwise logging will be writen to the filename in METRIC_LOG.

In the log file, each line is a json with data from one iteration.
- "num_tokens" is the number of token batched in this iteration.
- "num_prefill" is the number of prefill request.
- "num_request" is the number of total request.
- number of decode request can be calculated using num_request - num_prefill
- "ctx_prefill" is the sum of context length of all the prefill requests.
- "ctx_decode" is the sum of context length of all the decode requests.

When we calcuate the parameters in for time estimation model, we use data whose num_tokens equals to maximum number of batch tokens and num_prefill equals to 1 as it is the common scenario. 

Example for measure prefill execution time:
run bash launch_only_prefill.sh in vllm_benchmark\disagg_benchmarks
and run bash launch_prefill_benchmark.sh

Example for chunked prefill interation execution time:
run bash launch_only_chunked.sh in vllm_benchmark\disagg_benchmarks
and run bash launch_benchmark_batch2.sh

