# Cluster Execution Guide

This guide explains how to run hyperparameter tuning experiments on a SLURM cluster by extracting functionality from the hyperparameter tuning dashboard.

## Overview

The cluster execution system consists of three main components:

1. **Configuration Generator** (`generate_cluster_config.py`): Creates a JSON file listing all dataset × model combinations
2. **Experiment Runner** (`run_cluster_experiment.py`): Runs a single experiment based on config file and run ID
3. **SLURM Submission Script** (`submit_cluster_jobs.sh`): Submits all jobs to the cluster

## Workflow

### Step 1: Generate Configuration File

Create a JSON configuration file that lists all experiments to run:

```bash
python cluster_scripts/generate_cluster_config.py \
    --datasets '{"type": "openml", "dataset_id": 1, "name": "dataset1"}' \
              '{"type": "openml", "dataset_id": 2, "name": "dataset2"}' \
    --models MPFRegressor XGBRegressor LGBMRegressor \
    --output cluster_config.json \
    --n-trials 100 \
    --cv-strategy simple \
    --random-seed 42
```

**With custom hyperparameters:**

```bash
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 3 \
    --models MPFRegressor XGBRegressor \
    --hyperparams custom_hyperparams.json \
    --output cluster_config.json
```

**Alternative: Using dataset IDs directly**

```bash
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 3 4 5 \
    --models MPFRegressor XGBRegressor \
    --output cluster_config.json
```

**Using dataset config files:**

```bash
python cluster_scripts/generate_cluster_config.py \
    --datasets dataset1.json dataset2.json \
    --models MPFRegressor XGBRegressor \
    --output cluster_config.json
```

**Using OpenML benchmark suites:**

```bash
# All tasks from a suite
python cluster_scripts/generate_cluster_config.py \
    --suite 269 \
    --models MPFRegressor XGBRegressor \
    --output cluster_config.json

# First 10 smallest datasets (by n*p) from suite
python cluster_scripts/generate_cluster_config.py \
    --suite "353[1-10]" \
    --models MPFRegressor XGBRegressor \
    --output cluster_config.json

# Datasets 5-15 from suite (sorted by n*p)
python cluster_scripts/generate_cluster_config.py \
    --suite "269[5-15]" \
    --models MPFRegressor \
    --output cluster_config.json
```

Suite tasks are **automatically sorted by n×p (ascending)**, where n=rows and p=features.
Use indexing `[start-end]` to select a subset (1-based, inclusive).

Popular suites:
- **99**: OpenML-CC18 (all tasks)
- **269**: OpenML-CC18 Regression
- **271**: OpenML-CC18 Classification
- **353**: OpenML Regression Suite

**Combining datasets and suite:**

```bash
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 3 \
    --suite "269[1-5]" \
    --models MPFRegressor XGBRegressor \
    --output cluster_config.json
```

### Step 2: Review Configuration

The generated JSON file will have this structure. **Runs are automatically sorted by n×p (ascending)**, where n=rows and p=features, ensuring smallest datasets run first:

```json
{
  "metadata": {
    "total_runs": 6,
    "n_datasets": 2,
    "n_models": 3,
    "description": "Cluster execution configuration for hyperparameter tuning"
  },
  "global_config": {
    "optimization": {
      "method": "optuna",
      "n_trials": 100,
      "random_seed": 42
    },
    "cv": {
      "strategy": "simple",
      "train_split": 0.8,
      "simple_cv_folds": 3
    },
    "resources": {
      "n_jobs": 1,
      "random_seed": 42
    }
  },
  "runs": [
    {
      "dataset": {"type": "openml", "dataset_id": 1, "name": "dataset1"},
      "model": "MPFRegressor",
      "optimization": {...},
      "cv": {...},
      "resources": {...}
    },
    ...
  ]
}
```

Runs are sorted alphabetically by dataset name, then model name, ensuring consistent ordering.

### Step 3: Submit Jobs to Cluster

#### Option A: Using the submission script

The script `submit_cluster_jobs.sh` uses environment variables for cluster-specific paths (none are hardcoded in the repo for anonymous review). Set these before running:

- `SINGULARITY_IMAGE` — path to your Singularity container image (e.g. `path/to/slurm-notebook.sif`)
- `SLURM_PARTITION` — partition name (e.g. `shared` or your cluster’s short partition)
- `CONDA_ACTIVATE_SCRIPT` — script to source to get conda (e.g. `source /path/to/miniforge3/bin/activate`)
- `CONDA_ENV_PATH` — path to the conda env with MPF and dependencies (e.g. `/path/to/envs/py313`)

Then:

```bash
bash cluster_scripts/submit_cluster_jobs.sh cluster_config.json cluster_results/
```

This will submit one SLURM job for each run in the configuration file.

#### Option B: Manual submission

For each run ID (0 to total_runs - 1):

```bash
sbatch --job-name=mpf_exp_0 \
       --time=24:00:00 \
       --mem=16G \
       --cpus-per-task=4 \
       --output=cluster_results/slurm_%j_0.out \
       --error=cluster_results/slurm_%j_0.err \
       --wrap="python3 cluster_scripts/run_cluster_experiment.py --config cluster_config.json --run-id 0 --output cluster_results/"
```

#### Option C: Array job (more efficient)

Create a SLURM array job script:

```bash
#!/bin/bash
#SBATCH --job-name=mpf_experiments
#SBATCH --array=0-29
#SBATCH --time=24:00:00
#SBATCH --mem=16G
#SBATCH --cpus-per-task=4
#SBATCH --output=cluster_results/slurm_%A_%a.out
#SBATCH --error=cluster_results/slurm_%A_%a.err

python3 cluster_scripts/run_cluster_experiment.py \
    --config cluster_config.json \
    --run-id $SLURM_ARRAY_TASK_ID \
    --output cluster_results/
```

### Step 4: Monitor Jobs

```bash
# Check job status
squeue -u $USER

# Check specific job
scontrol show job <job_id>

# Cancel a job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER
```

### Step 5: Collect Results

Results are saved in the output directory as JSON files:

- `run_0000_dataset1_MPFRegressor.json` - Individual run results
- `run_0001_dataset1_XGBRegressor.json`
- etc.

Each result file contains:

```json
{
  "success": true,
  "dataset": "dataset1",
  "model": "MPFRegressor",
  "n_folds": 5,
  "mean_test_rmse": 0.1234,
  "std_test_rmse": 0.0056,
  "min_test_rmse": 0.1156,
  "max_test_rmse": 0.1345,
  "results": [...],
  "timestamp": "2024-01-01T12:00:00"
}
```

## Configuration Options

### Dataset Types

The system supports multiple dataset types:

1. **OpenML Dataset**: `{"type": "openml", "dataset_id": 1}` or just `--datasets 1`
2. **OpenML Task**: `{"type": "openml_task", "task_id": 1}`
3. **OpenML Benchmark Suite**: `--suite 269` or `--suite "353[1-10]"` (sorted by n×p)
4. **Friedman Synthetic**: `{"type": "friedman", "n_samples": 1000, "n_features": 10}`
5. **Local NPY files**: `{"type": "local_npy", "data_path_x": "path/to/X.npy", "data_path_y": "path/to/y.npy"}`

### Cross-Validation Strategies

**Simple CV** (default):
- Single train/test split
- Cross-validation for hyperparameter optimization
- Options: `--cv-strategy simple --cv-train-split 0.8 --cv-simple-folds 3`

**Nested CV**:
- Outer CV for evaluation, inner CV for hyperparameter optimization
- Options: `--cv-strategy nested --cv-outer-folds 5 --cv-inner-folds 3`

### Optimization Methods

- `optuna` (default): Tree-structured Parzen Estimator
- `random`: Random search
- `grid`: Grid search

### Models

Available models:
- `MPFRegressor`
- `XGBRegressor`
- `LGBMRegressor`
- `RandomForestRegressor`

### Hyperparameter Ranges

The system includes two standard hyperparameter configuration files:

- **`cluster_scripts/hyperparams/blackbox.json`**: Hyperparameters for blackbox models (XGBRegressor, LGBMRegressor, RandomForestRegressor, ExplainableBoostingRegressor, MPFRegressor)
- **`cluster_scripts/hyperparams/interpretable.json`**: Hyperparameters for interpretable models (MPFRegressor, XGBRegressor, LGBMRegressor, RandomForestRegressor, ExplainableBoostingRegressor) with interpretability-focused settings (e.g., max_depth limited to 2 for tree-based models)

**Using standard hyperparameter files:**

```bash
# Use interpretable hyperparameters
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 3 \
    --models MPFRegressor XGBRegressor \
    --hyperparams cluster_scripts/hyperparams/interpretable.json \
    --output cluster_config.json

# Use blackbox hyperparameters
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 3 \
    --models MPFRegressor XGBRegressor \
    --hyperparams cluster_scripts/hyperparams/blackbox.json \
    --output cluster_config.json
```

### Custom Hyperparameter Ranges

You can also specify custom hyperparameter ranges using a JSON file. The format is:

```json
{
  "ModelName": {
    "param_name": ["distribution_type", arg1, arg2, ...],
    "fixed_param": [value],
    "categorical_param": ["choice1", "choice2", "choice3"]
  }
}
```

**Distribution types:**
- `["randint", min, max]` - Integer range (exclusive max)
- `["uniform", min, max]` - Uniform continuous
- `["loguniform", min, max]` - Log-uniform continuous
- `[value]` - Fixed value
- `[val1, val2, ...]` - Categorical choices

**Example hyperparameters file** (`custom_hyperparams.json`):

```json
{
  "MPFRegressor": {
    "epochs": ["randint", 1, 10],
    "n_trees": [200],
    "n_iter": ["randint", 50, 200],
    "decay": ["uniform", 0.85, 0.99],
    "alpha": ["loguniform", 1e-5, 0.1],
    "refinement_strategy": ["l2", "huber"]
  },
  "XGBRegressor": {
    "n_estimators": ["randint", 100, 1000],
    "learning_rate": ["uniform", 0.01, 0.3],
    "max_depth": ["randint", 3, 15]
  }
}
```

**Note:** The standard hyperparameter files (`blackbox.json` and `interpretable.json`) already include comprehensive configurations for all supported models. Use custom files only when you need to override specific parameter ranges.

**Usage:**

```bash
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 3 \
    --models MPFRegressor XGBRegressor \
    --hyperparams custom_hyperparams.json \
    --output cluster_config.json
```

Or pass as JSON string:

```bash
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 \
    --models MPFRegressor \
    --hyperparams '{"MPFRegressor": {"epochs": ["randint", 5, 15]}}' \
    --output cluster_config.json
```

## Example: Complete Workflows

### Example 1: Individual Datasets

```bash
# 1. Generate config for 10 datasets × 3 models = 30 runs
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 3 4 5 6 7 8 9 10 \
    --models MPFRegressor XGBRegressor LGBMRegressor \
    --output cluster_config.json \
    --n-trials 100 \
    --cv-strategy simple

# 2. Submit all 30 jobs
bash cluster_scripts/submit_cluster_jobs.sh cluster_config.json cluster_results/

# 3. Monitor
watch -n 10 'squeue -u $USER'

# 4. After completion, aggregate results
python cluster_scripts/aggregate_cluster_results.py cluster_results/
```

### Example 2: OpenML Benchmark Suite (First 10 Smallest Datasets)

```bash
# 1. Generate config for first 10 datasets by n*p from suite 353
#    With 3 models = 30 runs
#    Using interpretable hyperparameters (max_depth=2 for trees)
python cluster_scripts/generate_cluster_config.py \
    --suite "353[1-10]" \
    --models MPFRegressor XGBRegressor LGBMRegressor \
    --hyperparams cluster_scripts/hyperparams/interpretable.json \
    --output cluster_config.json \
    --n-trials 200 \
    --cv-strategy nested \
    --cv-outer-folds 5 \
    --cv-inner-folds 3

# 2. Submit all jobs
bash cluster_scripts/submit_cluster_jobs.sh cluster_config.json suite353_results/

# 3. Monitor progress
squeue -u $USER | wc -l  # Count remaining jobs

# 4. Aggregate results
python cluster_scripts/aggregate_cluster_results.py suite353_results/ --output suite353_summary.csv
```

### Example 3: Full Benchmark Suite

```bash
# Run on all datasets from OpenML-CC18 Regression suite (269)
python cluster_scripts/generate_cluster_config.py \
    --suite 269 \
    --models MPFRegressor \
    --output cluster_config.json \
    --n-trials 100
```

### Example 4: Mixed Datasets and Suite

```bash
# Combine specific datasets with first 5 from a suite
python cluster_scripts/generate_cluster_config.py \
    --datasets 1 2 3 \
    --suite "269[1-5]" \
    --models MPFRegressor \
    --output cluster_config.json
```

## Troubleshooting

### Job fails immediately

- Check that Python environment is activated
- Verify all dependencies are installed
- Check SLURM output/error files

### Out of memory errors

- Increase `--mem` in SLURM script
- Reduce `--n-jobs` in configuration

### Dataset loading fails

- Check internet connection (for OpenML datasets)
- Verify dataset IDs are valid
- Check OpenML API limits

### Model import errors

- Ensure `mpf-py` is installed and in Python path
- Check that all model dependencies (xgboost, lightgbm) are installed

## Advanced Usage

### Custom Model Parameters

To use custom model parameters, modify `run_cluster_experiment.py` to accept a model configuration file, or extend the configuration JSON format.

### Parallel Execution

Each job runs a single experiment. For parallel execution within a job (e.g., multiple CV folds), use `--n-jobs` in the configuration.

### Resume Failed Jobs

If a job fails, you can rerun it with the same run ID:

```bash
python cluster_scripts/run_cluster_experiment.py \
    --config cluster_config.json \
    --run-id 5 \
    --output cluster_results/
```

The script will overwrite the previous result file.

## Integration with Dashboard

Results from cluster execution can be imported back into the dashboard database if needed. The result JSON format is compatible with the dashboard's experiment structure.
