# Sparsity And Variance - Anonymous Submission

This repository contains code for training and evaluating language models with a focus on sparsity and variance analysis.

## Installation

### 1. Set up Python environment

Create a Python 3.9+ environment:

```bash
# Create virtual environment
python3.9 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

### 2. Install PyTorch 2.6.0

```bash
pip install torch==2.6.0 torchvision torchaudio
```


### 3. Install flash attention (optional but recommended for faster training)

```bash
pip install flash-attn --no-build-isolation
```

### 4. Install the project and dependencies

```bash
# Install the project in development mode
pip install -e .

# Install additional dependencies
pip install matplotlib datasets scikit-learn torchmetrics wandb
```

## Training

### Using train_script.sh

The main training script is `scripts/train_script.sh`, which provides a convenient wrapper for training models with different configurations.

**Usage:**

```bash
bash scripts/train_script.sh <config> <batch_size> <global_train_batch_size> <learning_rate> <visible_gpus> <master_port> <run_suffix> <tokenizer_name> <dataset_name>
```

**Parameters:**

- `config`: Configuration file name (without .yaml extension), e.g., `olmoe-1B-7B`, `adamw-1B`
- `batch_size`: Per-device microbatch size
- `global_train_batch_size`: Global batch size across all GPUs
- `learning_rate`: Learning rate (e.g., `4e-4`)
- `visible_gpus`: Comma-separated list of GPU IDs (e.g., `0,1,2,3`)
- `master_port`: Master port for distributed training (default: `29500`)
- `run_suffix`: Suffix for the run name (default: `none`)
- `tokenizer_name`: Tokenizer to use (`gptneox` or `qwen`)
- `dataset_name`: Dataset to use (`c4` or `fineweb`)

**Examples:**

```bash
# Train a 1B model with GPTNeoX tokenizer on C4 dataset
bash scripts/train_script.sh adamw-1B 4 512 4e-4 0,1,2,3 29500 exp1 gptneox c4

# Train a MoE model with Qwen tokenizer on FineWeb
bash scripts/train_script.sh olmoe-1B-7B 4 1024 4e-4 0,1,2,3 29500 exp2 qwen fineweb
```

**Configuration Files:**

Available configuration files in `configs/`:

- `adamw-1B.yaml` - 1B dense model with AdamW optimizer
- `adamw-400M-momentum.yaml` - 400M dense model
- `olmoe-1B-7B.yaml` - 1B MoE model with 7B capacity
- `olmoe-1B-7B-ablation.yaml` - 1B MoE model (ablation study)
- `olmoe-400M-2B.yaml` - 400M MoE model with 2B capacity

### Customizing Training

You can modify the configuration files in `configs/` to adjust hyperparameters such as:

- Model architecture (d_model, n_layers, n_heads, etc.)
- MoE configuration (num_experts, top_k, etc.)
- Optimizer settings (learning_rate, weight_decay, etc.)
- Training parameters (max_duration, batch sizes, etc.)
- Data paths

### Direct Training with torchrun

For more control, you can use the training script directly:

```bash
torchrun --nproc-per-node=4 --master-port=29500 scripts/train.py \
    configs/olmoe-1B-7B.yaml \
    --run_name=my_experiment \
    --max_duration=2ep \
    --global_train_batch_size=1024 \
    --device_train_microbatch_size=4
```

## Converting OLMo Checkpoints to HuggingFace Format

After training, you can convert OLMo checkpoints to HuggingFace format for easier evaluation and use.

### Convert Dense Models

For dense (non-MoE) models:

```bash
python scripts/convert_tools/convert_olmo_hf.py \
    --input_dir /path/to/olmo/checkpoint \
    --output_dir /path/to/hf/output \
    --tokenizer_json_path /path/to/tokenizer.json \
    --safe_serialization True
```

**Parameters:**

- `--input_dir`: Path to OLMo checkpoint directory containing `config.yaml` and `model.pt`
- `--output_dir`: Path to save the converted HuggingFace model
- `--tokenizer_json_path`: (Optional) Path to tokenizer JSON file
- `--safe_serialization`: Whether to use safetensors format (default: True)

### Convert MoE Models

For Mixture of Experts models:

```bash
python scripts/convert_tools/convert_olmo_moe_hf.py \
    --input_dir /path/to/olmo/moe/checkpoint \
    --output_dir /path/to/hf/moe/output \
    --tokenizer_json_path /path/to/tokenizer.json \
    --safe_serialization True
```

**Additional MoE-specific parameters:**

- `--force_unshard`: Force unsharding of distributed checkpoints

**Note:** The conversion script will automatically detect if the checkpoint is a distributed checkpoint (with `.distcp` files) and unshard it before conversion.

### Conversion Output

The conversion process will create:

- `config.json` - HuggingFace model configuration
- `pytorch_model.bin` or `model.safetensors` - Model weights
- `tokenizer.json` - Tokenizer files
- `pytorch_model.bin.index.json` - Index for sharded checkpoints

## Evaluation with lm-eval-harness

Once you have converted your model to HuggingFace format, you can evaluate it using the `lm-eval-harness` framework.

### Install lm-eval-harness

```bash
pip install lm-eval[api]
```

### Basic Evaluation

```bash
lm_eval --model hf \
    --model_args pretrained=/path/to/hf/model,trust_remote_code=True \
    --tasks hellaswag,arc_challenge \
    --batch_size 8
```

**Parameters:**

- `--model`: Model type (`hf` for HuggingFace)
- `--model_args`: Model arguments (pretrained path, trust_remote_code, etc.)
- `--tasks`: Comma-separated list of tasks
- `--batch_size`: Batch size for evaluation
- `--num_fewshot`: Number of few-shot examples (default: 0)
- `--device`: Device to use (`cuda` or `cpu`)
- `--output_path`: Path to save evaluation results
- `--log_samples`: Whether to log individual samples

### Example Evaluation Script

```bash
#!/bin/bash
MODEL_PATH="/path/to/hf/model"
OUTPUT_DIR="./eval_results"

# Run comprehensive evaluation
lm_eval --model hf \
    --model_args pretrained=${MODEL_PATH},trust_remote_code=True \
    --tasks hellaswag,piqa,arc_challenge \
    --batch_size 8 \
    --num_fewshot 5 \
    --device cuda \
    --output_path ${OUTPUT_DIR} \
    --log_samples
```

## Project Structure

```
.
├── configs/                    # Training configuration files
│   ├── adamw-1B.yaml
│   ├── adamw-400M-momentum.yaml
│   ├── olmoe-1B-7B.yaml
│   ├── olmoe-1B-7B-ablation.yaml
│   └── olmoe-400M-2B.yaml
├── olmo/                       # Core OLMo implementation
│   ├── config.py
│   ├── model.py
│   ├── moe.py
│   ├── train.py
│   └── ...
├── scripts/                    # Training and utility scripts
│   ├── train.py
│   ├── train_script.sh
│   ├── train_torchrun.py
│   └── convert_tools/
│       ├── convert_olmo_hf.py
│       └── convert_olmo_moe_hf.py
├── pyproject.toml              # Project dependencies
└── README.md                   # This file
```