# MAPS: Memory-Aware Predictive Scheduling Framework for Large Language Models Serving

## Overview

MAPS (Memory-Aware Predictive Scheduling) is a predictive scheduling framework for large language model serving, based on vLLM 0.13.0. The framework optimizes latency of LLM serving through output length prediction and memory-aware scheduling.

## Requirements

- Python 3.12
- vLLM 0.13.0
- Redis (default port: 6379)
- CUDA (if using GPU)

## Installation

### 1. Install Dependencies

```bash
# Install vLLM 0.13.0
pip install vllm==0.13.0

# Install Redis
# Ubuntu/Debian
sudo apt-get install redis-server

# Start Redis (default port 6379)
redis-server
```

### 2. File Replacement Instructions

Due to OpenReview limitations, we cannot upload the complete source code. Please replace the corresponding files in vLLM 0.13.0 according to the following paths:

#### Files to Replace:

1. **vllm/distributed/kv_transfer/kv_connector/v1/mooncake_connector.py**
   - Replace with the provided `mooncake_connector.py`

2. **vllm/v1/engine/core.py**
   - Replace with the provided `core.py`

3. **vllm/v1/request.py**
   - Replace with the provided `request.py`

4. **vllm/entrypoints/openai/protocol.py**
   - Replace with the provided `protocol.py`

5. **vllm/entrypoints/openai/serving_engine.py**
   - Replace with the provided `serving_engine.py`

6. **vllm/v1/engine/__init__.py**
   - Replace with the provided `__init__.py`

7. **vllm/entrypoints/openai/serving_completion.py**
   - Replace with the provided `serving_completion.py`

8. **vllm/config/scheduler.py**
   - Replace with the provided `scheduler.py`

9. **vllm/v1/core/sched/request_queue.py**
   - Replace with the provided `request_queue.py`


## Configuration

### Redis Configuration

Ensure Redis service is running on the default port 6379:

```bash
# Check if Redis is running
redis-cli ping
# Should return: PONG
```


## Usage Examples

### Starting the Service

MAPS uses a disaggregated architecture with separate Prefill and Decode instances. Replace `<model_path>` with your actual model path.

#### Prefill Instance

```bash
vllm serve <model_path> \
  --host 0.0.0.0 \
  --port 8100 \
  --scheduling-policy sjf \
  --kv-transfer-config '{
    "kv_connector": "MooncakeConnector",
    "kv_role": "kv_producer",
    "engine_id": "prefill",
    "kv_connector_extra_config": {
      "enable_intelligent_routing": true,
      "redis_host": "127.0.0.1",
      "redis_port": 6379,
      "num_workers": 10
    }
  }'
```

#### Decode Instance 1

```bash
vllm serve <model_path> \
  --host 0.0.0.0 \
  --port 8200 \
  --scheduling-policy sjf \
  --kv-transfer-config '{
    "kv_connector": "MooncakeConnector",
    "kv_role": "kv_consumer",
    "engine_id": "decode1",
    "kv_connector_extra_config": {
      "enable_capacity_reporting": true,
      "redis_host": "127.0.0.1",
      "redis_port": 6379,
      "decode_id": "0.0.0.0:8200",
      "capacity_report_interval": 1.0,
      "num_workers": 10
    }
  }'
```

#### Decode Instance 2

```bash
vllm serve <model_path> \
  --host 0.0.0.0 \
  --port 8300 \
  --scheduling-policy sjf \
  --kv-transfer-config '{
    "kv_connector": "MooncakeConnector",
    "kv_role": "kv_consumer",
    "engine_id": "decode2",
    "kv_connector_extra_config": {
      "enable_capacity_reporting": true,
      "redis_host": "127.0.0.1",
      "redis_port": 6379,
      "decode_id": "0.0.0.0:8300",
      "capacity_report_interval": 1.0,
      "num_workers": 10
    }
  }'
```

### API Call Example

Send requests to the Prefill instance (port 8100):

```python
import openai

client = openai.OpenAI(
    base_url="http://localhost:8100/v1",
    api_key="dummy"
)

# Request with predicted output length
completion = client.completions.create(
    model="your-model",
    prompt="Explain quantum computing",
    predicted_output_length=200,  # Predicted output length
    max_tokens=500,
    temperature=0.7
)

print(completion.choices[0].text)
```
