# Libra: Dynamic Load Balancing with Speculative Expert Prefetching and Optimal Token Assignment

An SGLang-base framework that dynamically balances MoE token loads via speculative expert prefetching and optimal token assignment—minimizing stragglers and improving throughput.

## Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)

## Overview

Libra augments SGLang to provide dynamic MoE load balancing. It predicts expert demand and prefetches experts speculatively, shards token adaptively to meet load balancing, and thus improves throughput and lowers throughput fluctuation.

## Features
- **Two-Stage Locality-Aware Execution**: Splits MoE computation into two phases based on token locality
  - MoE<sub>local</sub>: Processes token routed to experts residing on the same GPU as the tokens themselves
  - MoE<sub>remote</sub>: Handles tokens that must be dispatched to other GPUs
- **Global-Local Hot Expert Replication**: Introduces an additional refinement to hot expert replication 
  - Local hot experts: Extends the MoE<sub>local</sub> computation window, thereby providing more opportunity to hide token sharding over
  - Global hot experts: Provides the flexibility to redistribute tokens form overloaded GPUs
- **Token Sharding**: Balances workload by determining the optimal assignment of remote tokens to specific GPUs.

## Prerequisites
- Python 3.10
- CUDA 12.6+
- HBM-equipped NVIDIA GPUs
- NVLink or NVSwitch interconnect

## Installation

### 1. Clone SGLang

```bash
# Clone PyTorch
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout 023288645b80fb41b3eed55fd413dd69a7904593
cd ..

# For Libra
mv sglang sglang_libra

# For Lina
mv sglang sglang_lina

# For Libra Internal
mv sglang sglang_libra_internal

```

### 2. Create Environment and Install
```bash
# Environment setup - Libra
conda create -n libra python=3.10 -y
conda activate libra

# Environment setup - Lina
conda create -n lina python=3.10 -y
conda activate lina

# Environment setup - Libra Internal
# Libra Internal is required for imbalance ratio measurement
conda create -n libra_internal python=3.10 -y
conda activate libra_internal

# Install dependencies
pip install -e "python[all]"
```

### 3. Setup Libra, Lina, and Libra Internal

```bash
# Setup for 
cd sglang_libra
apply ../sglang_libra.diff

# Setup for Lina
cd sglang_lina
apply ../sglang_lina.diff

# Setup for Libra Internal (to calcluate imbalance ratio)
cd sglang_libra_internal
apply ../sglang_libra_internal.diff
```

## Usage

### Libra
```bash
bash single_node_scripts/ep_test.sh ${MODEL} ${SEQ_LEN} ${MINI_BATCH_SIZE} EP ${DATASET} train ${START_PORTION} ${END_PORTION} ${START_IDX} ${END_IDX} ${N} ${L} ${SEQ_LENS_SUM} ${SCHEME}

# Usage
# bash ./single_node_scripts/ep_test.sh Qwen3-235B-A22B 1024 2 EP bookcorpus train 0 1 800 1000 6 4 2048 libra
```
- MODEL: Model to load and run
- SEQ_LEN: Sequence length
- MINI_BATCH_SIZE: Mini batch size
- DATASET: Dataset to use. Downloading from huggingface may required.
- START_PORTION: Start portion of the dataset
- END_PORTION: End portion of the dataset
- START_IDX: Start index of the dataset that used for evaluation
- END_IDX: End idx of the dataset that used for evaluation
- N: The number of prefetched experts per GPU
- L: The number of local hot experts per GPU
- SEQ_LENS_SUM: SEQ_LEN * MINI_BATCH_SIZE
- SCHEME: Scheme to apply
- **CAUTION**: Check variable is_ori before running script.
  - For SGLang and EPLB scheme, is_ori should be True
  - For Libra, is_ori should be False

### Configuration - Lina
```bash
# Run Qwen3MoE
bash single_node_scripts/ep_test_qwen3.sh ${MODEL} ${SEQ_LEN} ${MINI_BATCH_SIZE} EP ${DATASET} train ${START_PORTION} ${END_PORTION} ${START_IDX} ${END_IDX} ${N} ${SEQ_LENS_SUM}

# Usage
# bash ./single_node_scripts/ep_test_qwen3.sh Qwen3-235B-A3B 1024 2 EP bookcorpus train 0 1 800 1000 6 2048

# Run GLM-4.5
bash single_node_scripts/ep_test_glm45.sh ${MODEL} ${SEQ_LEN} ${MINI_BATCH_SIZE} EP ${DATASET} train ${START_PORTION} ${END_PORTION} ${START_IDX} ${END_IDX} ${N} ${SEQ_LENS_SUM}

# Usage
# bash ./single_node_scripts/ep_test_glm45.sh GLM-4.5 1024 2 EP bookcorpus train 0 1 800 1000 6 2048
```
- MODEL: Model to load and run
- SEQ_LEN: Sequence length
- MINI_BATCH_SIZE: Mini batch size
- DATASET: Dataset to use. Downloading from huggingface may required.
- START_PORTION: Start portion of the dataset
- END_PORTION: End portion of the dataset
- START_IDX: Start index of the dataset that used for evaluation
- END_IDX: End idx of the dataset that used for evaluation
- N: The number of prefetched experts per GPU
- SEQ_LENS_SUM: SEQ_LEN * MINI_BATCH_SIZE
- **CAUTION**: Check variable is_ori before running script.
  - For Lina, is_ori should be False