# SpareTrain: An LLM Training Framework to Achieve Complete DMR Protection

A PyTorch-based framework for enhancing Large Language Model reliability through Double Modular Redundancy (DMR).

## Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Configuration](#configuration)
- [Usage](#usage)
- [Contributing](#contributing)

## Overview

SpareTrain extends PyTorch and TorchTitan to provide comprehensive reliability mechanisms for large-scale language model training. It implements DMR with three phases of error detection and recovery, along with advanced memory management and communication optimization.

## Features

### Core Reliability Features
- **Double Modular Redundancy (DMR) Planning**: Three-phase error detection system
  - Phase 1: Planning for Piggyback-DMR (P-DMR)
  - Phase 2: Coarse-Grained Planning for Deferred-DMR (D-DMR)
  - Phase 3: Fine-Grained Planning for Deferred-DMR (D-DMR)
- **Silent Data Corruption (SDC) Detection**: Automatic error detection and re-execution

## Installation

### Prerequisites
- Python 3.10+
- CUDA 12.6+
- HBM-equipped NVIDIA GPUs (required for symmetric memory)
- NVIDIA Hopper or newer architecture (required for grouped GEMM in MoE models)
- InfiniBand network (recommended for multi-node training)

### Build from Source

#### 1. Clone Repositories

```bash
# Clone PyTorch
git clone https://github.com/pytorch/pytorch.git
cd pytorch
git checkout 6f23f53599629a47d6e097b2a027048658a142d4

# Clone TorchTitan
git clone https://github.com/pytorch/torchtitan.git
cd torchtitan
git checkout be15836a89c6c346c65a9c435427027f950b90d6
```

#### 2. Apply Modifications

```bash
# Apply PyTorch modifications
cd pytorch
git apply ../pytorch.diff

# Apply TorchTitan modifications
cd ../torchtitan
git apply ../torchtitan.diff
```

#### 3. Build PyTorch

```bash
cd pytorch

# Initialize submodules
git submodule sync
git submodule update --init --recursive

# Install dependencies
conda install -y cmake ninja rust
pip install -r requirements.txt
conda install -y -c pytorch magma-cuda124

# Build
make triton
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-"$(dirname $(which conda))/../"}:${CMAKE_PREFIX_PATH}"
python setup.py develop
```

#### 4. Setup TorchTitan

```bash
cd torchtitan

# Install dependencies
pip install -r requirements.txt

# Download tokenizers (optional)
huggingface-cli login
python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B
python scripts/download_tokenizer.py --repo_id mistralai/Mistral-Large-Instruct-2411
python scripts/download_tokenizer.py --repo_id meta-llama/Llama-4-Scout-17B-16E
```

## Quick Start

### Basic Configuration

Create a TOML configuration file with reliability settings:

```toml
[model]
name = "llama3"
flavor = "8B"

[optimizer]
name = "AdamW"
lr = 3e-4

[training]
batch_size = 8
steps = 1000
compile_memory_budget = 0.5

[reliability]
use_dmr = true
phase_1 = true
phase_2 = true
phase_3 = true
max_memory = 94  # GB, optional (defaults to GPU memory)

[parallelism]
tensor_parallel_degree = 2
pipeline_parallel_degree = 1
```

### Single-Node Training

```bash
cd torchtitan
CONFIG_FILE={your_config.toml} ./run_train.sh
```

### Multi-Node Training

```bash
# Master node (NODE_RANK=0)
NNODES=2 NODE_RANK=0 MASTER_ADDR="192.168.1.100" CONFIG_FILE={your_config.toml} ./run_multinode.sh

# Worker node (NODE_RANK=1)
NNODES=2 NODE_RANK=1 MASTER_ADDR="192.168.1.100" CONFIG_FILE={your_config.toml} ./run_multinode.sh
```

## Configuration

### Reliability Settings

The `[reliability]` section controls DMR behavior:

```toml
[reliability]
use_dmr = true                    # Enable/disable DMR system
phase_1 = true                    # Enable P-DMR (Piggyback DMR)
phase_2 = true                    # Enable D-DMR Inter (Deferred DMR Inter-node)
phase_3 = true                    # Enable D-DMR Intra (Deferred DMR Intra-node)
is_moe = false                    # Set true for MoE models
max_memory = 94                   # Memory limit in GB (optional)
```

**Phase Requirements**: DMR phases must be activated sequentially. You cannot skip phases.

Valid configurations:
- `phase_1=true, phase_2=false, phase_3=false` ✓
- `phase_1=true, phase_2=true, phase_3=false` ✓
- `phase_1=true, phase_2=true, phase_3=true` ✓
- `phase_1=true, phase_2=false, phase_3=true` ✗ (Invalid)

### Memory Management

**Memory Limitation**: The `max_memory` parameter controls D-DMR memory usage in GB. If not specified, it defaults to the current GPU's available memory.

**Usage Examples**:
- For 141GB H200 GPU running at 94GB: Set `max_memory = 94`
- For default behavior: Omit `max_memory` (uses full GPU memory)
- For custom limit: Set any value in GB (e.g., `max_memory = 80`)

### Parallelism Configuration

```toml
[parallelism]
tensor_parallel_degree = 4        # Tensor parallelism
pipeline_parallel_degree = 2      # Pipeline parallelism
expert_parallel_degree = 2        # Expert parallelism (for MoE)
```