# Untied-Ulysses

An efficient implementation of **Untied Ulysses** — a sequence parallelism method for training LLMs on extremely long contexts (up to 5M+ tokens).


## Installation

```bash
# From repo root
uv venv && source .venv/bin/activate

cd torchtitan
uv pip install torch==2.9
uv pip install flash-attn==2.8.3 --no-build-isolation --no-cache
uv sync --no-build-isolation --active

cd ../long-context-attention
uv pip install -e .
```

**Optional: FlashAttention 3 (Hopper GPUs)**

Required to reproduce exact tokens/second numbers from the paper.

```bash
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper
uv pip install -U setuptools wheel packaging ninja
MAX_JOBS=128 uv pip install . --no-build-isolation
```

## Usage

### Scripts

Scripts to reproduce Table 3 and Figure 4 from the paper are in `torchtitan/exps/llama3_8b/` and `torchtitan/exps/llama3_70b/`.

| Model | Script | Description |
|-------|--------|-------------|
| 8B | `run_single.sh` | Single experiment |
| 8B | `run_all.sh` | Full benchmark (128K-5M) |
| 8B | `run_single_multinode.sh` | Single experiment (2 nodes) |
| 8B | `run_all_multinode.sh` | Full benchmark (512K-9M, 2 nodes) |
| 70B | `run_single_70B.sh` | Single experiment |
| 70B | `run_all_70B.sh` | Full benchmark (128K-1M) |


### Quick Start

```bash
cd torchtitan/exps/llama3_8b

# Run single experiment
CONTEXT_LENGTH=131072 bash run_single.sh

# Run all experiments
bash run_all.sh
```

### Multinode (8B, 2 nodes)

Run the same command on **both nodes** simultaneously:

```bash
cd torchtitan/exps/llama3_8b

# Run single experiment
NNODES=2 RDZV_ENDPOINT="master-hostname:29500" bash run_single_multinode.sh

# Run all experiments
NNODES=2 RDZV_ENDPOINT="master-hostname:29500" bash run_all_multinode.sh
```

### Configuration

| Variable | Default | Description |
|----------|---------|-------------|
| `CONTEXT_LENGTH` | 2M (8B) / 128K (70B) | Sequence length |
| `STEPS` | 10 | Training steps |
| `BATCH_SIZE` | 1 | Batch size |
| `NNODES` | 1 | Number of nodes |
| `NPROC_PER_NODE` | 8 | GPUs per node |
| `RDZV_ENDPOINT` | localhost:29500 | Master node (multinode only) |

### Attention Methods

| `attn_impl` | Description |
|-------------|-------------|
| `upipe_fa2_offload_tiled_mlp` | **Untied Ulysses (ours)** |
| `usp_fa2_offload_tiled_mlp` | USP zigzag baseline |
| `torch_ring_alltoall` | PyTorch baseline |

For Hopper GPUs, use `_fa3_` variants instead of `_fa2_`.
