# DualMap

## Prerequisites

- **Hardware**: ≥ 8 GPUs (for full-scale experiments)
- **Software**: Python 3.10+, PyTorch, vLLM, HuggingFace `transformers`

## Installation

```bash
cd flex_attention_vllm
pip install .
```

## Data Preparation

Process the Mooncake open-source dataset:

```bash
python process_dataset.py  # Processed data saved to ./dataset
```

## Launch vLLM Servers

Start Qwen/Qwen2.5-7B-Instruct model replicas (example for 8 ports):

```bash
# Start replica 1 (port 8081)
python -u -m vllm.entrypoints.openai.api_server 
    --model Qwen/Qwen2.5-7B-Instruct
    --max-num-seqs=256
	--max-model-len=20480
    --max-num-batched-tokens=20480
	--dtype=float16
    --tensor-parallel-size=1
	--block-size=128
    --host=0.0.0.0
    --port=8081
    --gpu-memory-utilization=0.9
    --trust-remote-code
    --served-model-name "qwen2.5-7B-Instruct"
# Repeat the above command for ports 8082-8088 (bind to different GPUs)
```

## Run Experiments

Update server addresses in `start_exp.py`:

```python
replicas_ip_port = "127.0.0.1:8081,127.0.0.1:8082,127.0.0.1:8083,127.0.0.1:8084,127.0.0.1:8085,127.0.0.1:8086,127.0.0.1:8087,127.0.0.1:8088"
```

Start the experiment:

```bash
python start_exp.py
```



View results: Experiment outputs are saved in the `./result` directory.

## Important Note

The `dh_rebalance_thredhold` parameter in SLO-aware request routing is **dependent on the compute capacity of the cluster instances**. Adjust this value according to your actual hardware performance.