# Artifact Reproduction Guide for SpareTrain

This document describes the procedure for reproducing our experimental results.  
Our implementation builds on the PyTorch source code and the TorchTitan repository.  
Because the full codebase is large, we provide only the base git tag (e.g., `6f23f53599629a47d6e097b2a027048658a142d4`) along with the corresponding diff files in the supplementary material.  
Following the *Build from Source* instructions in `README.md` enables faithful reproduction of our environment.

## Evaluated Setup

### GPU Configuration
| Specification | Value |
|---------------|-------|
| **Model** | NVIDIA H200 |
| **Count per Node** | 8 |
| **Memory per GPU** | 141GB (143,771 MiB) |
| **Interconnect** | NVLink 18 (NV18) |
| **Driver Version** | 560.35.05 |
| **CUDA Version** | 12.6 |

### System Configuration
- **OS**: Ubuntu 22.04.4 LTS
- **Python**: 3.10.12
- **Multi-node**: 2-4 nodes connected via InfiniBand

## Prerequisites

- Custom PyTorch and TorchTitan builds (see README.md for setup)
- Required patch files: `pytorch.diff`, `torchtitan.diff`
- HuggingFace token for model and tokenizer access

## Available Configuration Files

**Default Settings**: All configurations use TP=8(or EP=8 for MoE), PP=4, sequence length=8K unless otherwise specified.

### Llama3 Configurations (`torchtitan/models/llama3/train_configs/`)
- **70b_pp2.toml**: Llama3 70B with Pipeline Parallelism degree 2 (for main eval)
- **70b_pp3.toml**: Llama3 70B with Pipeline Parallelism degree 3 (for main eval)
- **mistral.toml**: Mistral model with Pipeline Parallelism degree 4 (for main eval and ablation study)
- **mistral_pp3.toml**: Mistral with Pipeline Parallelism degree 3 (for main eval)
- **mistral_4k.toml**: Mistral with Pipeline Parallelism degree 4 and 4K token size (for sensitivity analysis)
- **mistral_16k.toml**: Mistral with Pipeline Parallelism degree 4 and 16K token size (for sensitivity analysis)
- **mistral_async.toml**: Mistral with Pipeline Parallelism degree 4 and async communication (for sensitivity analysis)

### Llama4 MoE Configurations (`torchtitan/experiments/llama4/train_configs/`)
- **16e.toml**: Llama4 Scout 17Bx16E standard configuration (not used)
- **16e_load.toml**: Llama4 Scout 17Bx16E with pretrained weight loading (for main eval and ablation study)

## Step 1: Preparing MoE Model Weights (Required only for MoE experiments).

For MoE models, convert HuggingFace weights to distributed checkpoint format:

### Configure TOML File
Set up checkpoint settings in your TOML file:
```toml
[checkpoint]
enable_checkpoint = true
folder = "checkpoint"
dump_folder = "/path/to/your/base/directory"
```

### Run Weight Conversion
```bash
cd torchtitan/experiments/llama4/scripts

./convert_hf_to_dcp_with_gpus.sh \
  --checkpoint.enable_checkpoint \
  --checkpoint.convert_path "/path/to/huggingface/model/snapshots/snapshot_id"
```

**Example for Llama-4-Scout:**
```bash
./convert_hf_to_dcp_with_gpus.sh \
  --checkpoint.enable_checkpoint \
  --checkpoint.convert_path "/root/filesystem/DMR/models--meta-llama--Llama-4-Scout-17B-16E/snapshots/14d516bdff6ac06cec40678529222f193386189c"
```

## Step 2: Generate Experiment Configurations

### Example 1: Llama3 70B Experiments (llama_pp3)
Using `70b_pp3.toml` with batch sizes 8, 16, 32 and max_memory 80GB:

```bash
cd torchtitan
python make_toml.py \
  --toml ./torchtitan/models/llama3/train_configs/70b_pp3.toml \
  --bsz 8 16 32 \
  --max_memory 80 \
  --mbs 1
```

### Example 2: MoE Model Experiments (16e_load)
Using `16e_load.toml` with batch sizes 16, 32, 64 and max_memory 141GB:

```bash
cd torchtitan
python make_toml.py \
  --toml ./torchtitan/experiments/llama4/train_configs/16e_load.toml \
  --bsz 16 32 64 \
  --moe \
  --max_memory 141 \
  --mbs 1
```

### Generated Experiment Types

#### vanilla_tasks/
Baseline configurations alternating between vanilla and naive:

**For Dense Models (Llama3):**
- **task1.toml**: Vanilla (DMR disabled), Memory budget 0.0
- **task2.toml**: Naive (basic DMR enabled), Memory budget 0.0
- **task3.toml**: Vanilla (DMR disabled), Memory budget 0.25
- **task4.toml**: Naive (basic DMR enabled), Memory budget 0.25
- **task5.toml**: Vanilla (DMR disabled), Memory budget 0.5
- **task6.toml**: Naive (basic DMR enabled), Memory budget 0.5
- **task7.toml**: Vanilla (DMR disabled), Memory budget 0.75
- **task8.toml**: Naive (basic DMR enabled), Memory budget 0.75
- **task9.toml**: Vanilla (DMR disabled), Memory budget 1.0
- **task10.toml**: Naive (basic DMR enabled), Memory budget 1.0

**For MoE Models (Llama4):**
- **task1.toml**: Vanilla (DMR disabled), Full recompute
- **task2.toml**: Naive (basic DMR enabled), Full recompute
- **task3.toml**: Vanilla (DMR disabled), Selective storage with 0.0 budget
- **task4.toml**: Naive (basic DMR enabled), Selective storage with 0.0 budget
- **task5.toml**: Vanilla (DMR disabled), Selective storage with 1.0 budget
- **task6.toml**: Naive (basic DMR enabled), Selective storage with 1.0 budget

#### our_tasks/
Full DMR implementation (phase_1+2+3):

**For Dense Models (Llama3):**
- **task1.toml**: Memory budget 0.0 (full recompute)
- **task2.toml**: Memory budget 0.25
- **task3.toml**: Memory budget 0.5
- **task4.toml**: Memory budget 0.75
- **task5.toml**: Memory budget 1.0

**For MoE Models (Llama4):**
- **task1.toml**: Full recompute (all operations use full recomputation)
- **task2.toml**: Selective storage with 0.0 budget
- **task3.toml**: Selective storage with 1.0 budget

#### ablation_tasks/
Systematic ablation studies alternating between phase combinations:

**For Dense Models (Llama3):**
- **task1.toml**: Phase 1 only, Memory budget 0.0
- **task2.toml**: Phase 1 + 2, Memory budget 0.0
- **task3.toml**: Phase 1 only, Memory budget 0.25
- **task4.toml**: Phase 1 + 2, Memory budget 0.25
- **task5.toml**: Phase 1 only, Memory budget 0.5
- **task6.toml**: Phase 1 + 2, Memory budget 0.5
- **task7.toml**: Phase 1 only, Memory budget 0.75
- **task8.toml**: Phase 1 + 2, Memory budget 0.75
- **task9.toml**: Phase 1 only, Memory budget 1.0
- **task10.toml**: Phase 1 + 2, Memory budget 1.0

**For MoE Models (Llama4):**
- **task1.toml**: Phase 1 only, Full recompute
- **task2.toml**: Phase 1 + 2, Full recompute
- **task3.toml**: Phase 1 only, Selective storage with 0.0 budget
- **task4.toml**: Phase 1 + 2, Selective storage with 0.0 budget
- **task5.toml**: Phase 1 only, Selective storage with 1.0 budget
- **task6.toml**: Phase 1 + 2, Selective storage with 1.0 budget

> **Note on MoE Memory Management**: Since TorchTitan originally did not support torch.compile for MoE layer, we enabled compile support only for attention layers while keeping MoE layers as recompute-only. We conducted experiments with various configurations by controlling MoE recompute behavior and attention compilation settings to achieve different memory usage patterns.


### Output Structure
```
70b_pp3/
├── bsz_8/
│   └── mbs_1/
│       └── max_80/
│           ├── vanilla_tasks/
│           ├── our_tasks/
│           └── ablation_tasks/
├── bsz_16/
└── bsz_32/

16e_load/
├── bsz_16/
├── bsz_32/
└── bsz_64/
```


## Step 3: Run Experiments

### Multi-Node Training
```bash
# On master node (NODE_RANK=0)
NNODES=3 NODE_RANK=0 MASTER_ADDR="10.42.101.136" \
CONFIG_FILE="./path/to/generated/task.toml" \
./run_multinode.sh

# On worker nodes (NODE_RANK=1,2,...)
NNODES=3 NODE_RANK=1 MASTER_ADDR="10.42.101.136" \
CONFIG_FILE="./path/to/generated/task.toml" \
./run_multinode.sh

NNODES=3 NODE_RANK=2 MASTER_ADDR="10.42.101.136" \
CONFIG_FILE="./path/to/generated/task.toml" \
./run_multinode.sh
```
