# FlowCast Instructions

Note: the experiments were held using the SLURM environment. The commands shown here are the commands we used to run the experiments.

## Setting up the Environment

You can set up the environment using either Docker (recommended for reproducibility) or Conda directly.

### Option 1: Using Docker (Recommended)

This method uses the provided `Dockerfile` to create a container with all necessary dependencies.

**1. Prerequisites:**

- [Docker](https://docs.docker.com/get-docker/)
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) for GPU support.

**2. Build the Docker image:**
From the root of the repository, run:

```bash
docker build -t flowcast-env .
```

**3. Run the Docker container:**
To start an interactive session inside the container with your local project directory mounted, run:

```bash
docker run --gpus all -it --ipc=host -v "$(pwd)":/app flowcast-env /bin/bash
```

You will be dropped into a shell inside the container, with the `nowcasting` conda environment ready to use. All the commands in the following sections should be run from within this container.

### Option 2: Using Conda

This method sets up the environment on your local machine.

**1. Prerequisites:**

- [Miniconda](https://docs.conda.io/en/latest/miniconda.html) or [Anaconda](https://www.anaconda.com/products/distribution).

**2. Create the Conda environment from the `environment.yml` file:**

NOTE: The environment.yml was generated directly from the training environment and it may contain unused packages, but we decided to keep them to maintain perfect reproducibility.

```bash
conda env create -f environment.yml
```

**3. Activate the environment:**

```bash
conda activate nowcasting
```

**4. Install PyTorch with GPU support:**
The `environment.yml` file may not install the correct GPU-enabled version of PyTorch. It is recommended to run the following command to ensure compatibility with CUDA 12.4:

```bash
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y
```

Once the environment is set up, you can run the training and testing commands.

## Reproducing Experiments on a SLURM Cluster with Enroot

The original experiments were conducted on a SLURM-managed HPC cluster using Enroot to run containers for a consistent and reproducible environment. The instructions below outline how to replicate this setup.

### 1. Container Setup with Docker and Enroot

This workflow involves building a Docker image, pushing it to a public registry like Docker Hub, and then importing it into the SLURM cluster's Enroot storage.

**a. Build and Push the Docker Image:**
First, build the Docker image locally and push it to your Docker Hub account.

```bash
# Log in to Docker Hub (enter your credentials when prompted)
docker login

# Build the image, replacing <username> with your Docker Hub username
docker build -t <your-dockerhub-username>/flowcast-env:latest .

# Push the image to Docker Hub
docker push <your-dockerhub-username>/flowcast-env:latest
```

**b. Import the Container on the SLURM Cluster:**
Next, run a one-time command on your SLURM cluster to pull the image from Docker Hub and convert it into an Enroot squashfs (`.sqfs`) file. This file will be reused for all subsequent jobs.

```bash
srun --partition=<your-partition> --mem=8G --container-image=docker://<your-dockerhub-username>/flowcast-env:latest --container-save=/path/on/shared/storage/flowcast_latest.sqfs bash -c "echo 'Container pulled and saved.'"
```

Replace `<your-partition>`, `<your-dockerhub-username>`, and `/path/on/shared/storage/` with the appropriate values for your environment.

### 2. Creating a SLURM Submission Script

Below is a template SLURM script to run a distributed training job using the Enroot container. You will need to adapt the paths and resource requests (`#SBATCH` directives) for your specific cluster configuration.

Create a file named `run_slurm.sh`:

```bash
#!/bin/bash
#SBATCH --job-name=train-flowcast
#SBATCH --partition=your_partition      # Specify the partition (queue)
#SBATCH --gres=gpu:4                    # Request 4 GPUs
#SBATCH --nodes=1                       # Number of nodes
#SBATCH --ntasks-per-node=1             # One task per node
#SBATCH --cpus-per-task=64              # CPUs per task
#SBATCH --mem=144G                      # Total memory for the job
#SBATCH --time=120:00:00                # Time limit
#SBATCH --output=slurm_logs/dist_train_%j.log

# --- Environment Setup ---
# Set your W&B API key if you are using Weights & Biases
# export WANDB_API_KEY=YOUR_API_KEY

# --- Paths ---
# Path to the Enroot container image on your cluster
CONTAINER_IMAGE=/path/on/shared/storage/flowcast_latest.sqfs
# Path to your project directory on the cluster
PROJECT_DIR=$(pwd)

# --- Script to run ---
# This variable holds the script and arguments that torchrun will execute.
# Modify this block to run different training or testing scripts.
# Example for training SEVIR FlowCast:
SCRIPT_WITH_ARGS="experiments/sevir/runner/flowcast/dist_train_flowcast.py \\
--config experiments/sevir/runner/flowcast/flowcast_config.yaml \\
--train_file datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_training_full.h5 \\
--train_meta datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_training_full_META.csv \\
--val_file datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_validation_full.h5 \\
--val_meta datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_validation_full_META.csv \\
--partial_evaluation_file datasets/sevir/data/sevir_full/nowcast_validation_full.h5 \\
--partial_evaluation_meta datasets/sevir/data/sevir_full/nowcast_validation_full_META.csv"

# --- torchrun Rendezvous Setup ---
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$((10000 + ($SLURM_JOB_ID % 50000)))

echo "----------------------------------------------------"
echo "SLURM JOB ID: $SLURM_JOB_ID"
echo "Nodes Allocated: $SLURM_JOB_NODELIST"
echo "Master Node Addr: $MASTER_ADDR, Port: $MASTER_PORT"
echo "Num Nodes: $SLURM_NNODES, GPUs per Node: $SLURM_GPUS_ON_NODE"
echo "----------------------------------------------------"

# --- Execute the Job ---
srun \\
  --container-image="${CONTAINER_IMAGE}" \\
  --container-mounts="${PROJECT_DIR}":/app \\
  --container-workdir=/app \\
  bash -c "
    export PYTHONUNBUFFERED=1;
    source activate nowcasting && \\
    torchrun --nnodes=\$SLURM_NNODES \\
             --nproc_per_node=\$SLURM_GPUS_ON_NODE \\
             --rdzv_id=\$SLURM_JOB_ID \\
             --rdzv_backend=c10d \\
             --rdzv_endpoint=\$MASTER_ADDR:\$MASTER_PORT \\
             \${SCRIPT_WITH_ARGS}"

echo "--- Job $SLURM_JOB_ID finished ---"
```

### 3. Submitting the Job

To run a training job, modify the `SCRIPT_WITH_ARGS` variable in `run_slurm.sh` with the command you want to execute (you can copy the commands from the sections below), and then submit it to the SLURM scheduler:

```bash
sbatch run_slurm.sh
```

## Training and Testing FlowCast

In this section, we will provide the commands to train and test FlowCast on the SEVIR and ARSO datasets, which need to be run inside the Docker container by replacing
the commands in the "srun" block in the `run_slurm.sh` file.

### SEVIR

#### 0. Preprocessing the SEVIR Dataset

Download the VIL data from the SEVIR Dataset:

```bash
aws s3 cp --no-sign-request s3://sevir/CATALOG.csv CATALOG.csv
aws s3 sync --no-sign-request s3://sevir/data/vil .
```

Run the preprocessing script:

```bash
python datasets/sevir/sevir_preprocessing.py --catalog_csv_path CATALOG.csv --data_dir data/vil --val_cutoff 2019-01-01 00:00:00 --test_cutoff 2019-06-01 00:00:00 --output_dir sevir_full --img_type vil --keep_dtype True --downsample_factor 1
```

#### 1. Training the Autoencoder

To train the autoencoder, run the following command, adjusting the file paths and the number of nodes and processes per node as needed.
We recommend using 1 node with 4 H100 GPUs (Approximate time: 62 hours for 250 epochs - Global Batch Size 12).

```bash
torchrun --nnodes=1 --nproc_per_node=4 \
experiments/sevir/autoencoder/dist_train_autoencoder_kl.py \
--config experiments/sevir/autoencoder/autoencoder_kl_config.yaml \
--train_file datasets/sevir/data/sevir_full/nowcast_training_full.h5 \
--train_meta datasets/sevir/data/sevir_full/nowcast_training_full_META.csv \
--val_file datasets/sevir/data/sevir_full/nowcast_validation_full.h5 \
--val_meta datasets/sevir/data/sevir_full/nowcast_validation_full_META.csv
```

#### 2. Generating Latent Dataset

With the trained autoencoder, generate the latent dataset for training and validation of FlowCast.

NOTE: Replace the `--preload_model` path below with the actual path to the `early_stopping_model.pt` file generated in the previous step.

```bash
python experiments/sevir/autoencoder/generate_static_dataset.py \\
    --num_workers 12 \\
    --preload_model saved_models/sevir/autoencoder/models/early_stopping_model.pt \\
    --normalize_dataset \\
    --train_file datasets/sevir/data/sevir_full/nowcast_training_full.h5 \\
    --train_meta datasets/sevir/data/sevir_full/nowcast_training_full_META.csv \\
    --val_file datasets/sevir/data/sevir_full/nowcast_validation_full.h5 \\
    --val_meta datasets/sevir/data/sevir_full/nowcast_validation_full_META.csv \\
    --out_dir datasets/sevir/data/sevir_latent_vae
```

#### 3. Training FlowCast

With the latent dataset, train FlowCast. Replace the `train_file`, `train_meta`, `val_file`, `val_meta` with the Latent Dataset, and `partial_evaluation_file`, `partial_evaluation_meta` with the original validation dataset.

```bash
torchrun --nnodes=1 --nproc_per_node=4 \
experiments/sevir/runner/flowcast/dist_train_flowcast.py \
--config experiments/sevir/runner/flowcast/flowcast_config.yaml \
--train_file datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_training_full.h5 \
--train_meta datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_training_full_META.csv \
--val_file datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_validation_full.h5 \
--val_meta datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_validation_full_META.csv \
--partial_evaluation_file datasets/sevir/data/sevir_full/nowcast_validation_full.h5 \
--partial_evaluation_meta datasets/sevir/data/sevir_full/nowcast_validation_full_META.csv
```

#### 4. Testing FlowCast

After training the FlowCast model, test it on the testing dataset.

NOTE: Replace the `--artifacts_folder` path below with the actual path to the artifacts directory generated during the FlowCast training step.

```bash
python experiments/sevir/runner/flowcast/test_flowcast.py \\
    --artifacts_folder saved_models/sevir/flowcast \\
    --config experiments/sevir/runner/flowcast/flowcast_config.yaml \\
    --test_file datasets/sevir/data/sevir_full/nowcast_testing_full.h5 \\
    --test_meta datasets/sevir/data/sevir_full/nowcast_testing_full_META.csv
```

#### 5. Training FlowCast (Diffusion Objective)

With the latent dataset, train FlowCast (Diffusion Objective). Replace the `train_file`, `train_meta`, `val_file`, `val_meta` with the Latent Dataset, and `partial_evaluation_file`, `partial_evaluation_meta` with the original validation dataset.
Adjust the number of nodes and processes per node as needed. We recommend using 1 node with 4 H100 GPUs (Approximate time: 135 hours for 200 epochs - Global Batch Size 12).

```bash
torchrun --nnodes=1 --nproc_per_node=4 \
experiments/sevir/runner/flowcast_diffusion/dist_train_flowcast_diffusion.py \
--config experiments/sevir/runner/flowcast_diffusion/flowcast_diffusion_config.yaml \
--train_file datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_training_full.h5 \
--train_meta datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_training_full_META.csv \
--val_file datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_validation_full.h5 \
--val_meta datasets/sevir/data/sevir_full_latent_vae_kl1e4/nowcast_validation_full_META.csv \
--partial_evaluation_file datasets/sevir/data/sevir_full/nowcast_validation_full.h5 \
--partial_evaluation_meta datasets/sevir/data/sevir_full/nowcast_validation_full_META.csv
```

#### 6. Testing FlowCast (Diffusion Objective)

After training the FlowCast (Diffusion Objective) model, test it on the testing dataset.

NOTE: Replace the `--artifacts_folder` path below with the actual path to the artifacts directory generated during the diffusion training step.

```bash
python experiments/sevir/runner/flowcast_diffusion/test_flowcast_diffusion.py \\
    --artifacts_folder <path_to_your_artifacts_folder> \\
    --config experiments/sevir/runner/flowcast_diffusion/flowcast_diffusion_config.yaml \\
    --test_file datasets/sevir/data/sevir_full/nowcast_testing_full.h5 \\
    --test_meta datasets/sevir/data/sevir_full/nowcast_testing_full_META.csv
```

### ARSO (Dataset not yet published)

#### 0. Downloading the ARSO Dataset

Not yet available/published.

#### 1. Training the Autoencoder

To train the autoencoder, adjust the number of nodes and processes per node as needed. We recommend using 1 node with 4 H100 GPUs (Approximate time: 120 hours for 250 epochs - Global Batch Size 12).

```bash
torchrun --nnodes=1 --nproc_per_node=3 \
experiments/arso/autoencoder/dist_train_autoencoder_kl.py \
--config experiments/arso/autoencoder/autoencoder_kl_config.yaml \
--data_file datasets/arso/final_sequence_data/sequence_data_ds1.h5
```

#### 2. Generating Latent Dataset

With the trained autoencoder, generate the latent dataset for training and validation of FlowCast.

NOTE: Replace the `--autoencoder_path` below with the actual path to the `early_stopping_model.pt` file generated in the autoencoder training step.

```bash
python experiments/arso/autoencoder/generate_static_dataset.py \\
    --config experiments/arso/autoencoder/autoencoder_kl_config.yaml \\
    --autoencoder_path saved_models/autoencoder/models/early_stopping_model.pt \\
    --data_file datasets/arso/final_sequence_data/sequence_data_ds1.h5
```

#### 3. Training FlowCast

With the latent dataset, train FlowCast. Replace the `data_file`, `data_file_latent`, `partial_evaluation_file`, `partial_evaluation_meta` with the Latent Dataset, and `partial_evaluation_file`, `partial_evaluation_meta` with the original validation dataset.
Adjust the number of nodes and processes per node as needed. We recommend using 1 node with 4 H100 GPUs. (Approximate time: 120 hours for 200 epochs - Global Batch Size 12).

```bash
torchrun --nnodes=1 --nproc_per_node=4 \
experiments/arso/runner/flowcast/dist_train_flowcast.py \
--config experiments/arso/runner/flowcast/flowcast_config.yaml \
--data_file datasets/arso/final_sequence_data/sequence_data_ds1.h5 \
--data_file_latent datasets/arso/final_sequence_data/sequence_data_ds1_latent.h5
```

#### 4. Testing FlowCast

After training the FlowCast model, test it on the testing dataset.

NOTE: Replace the `--artifacts_folder` path below with the actual path to the artifacts directory generated during the ARSO FlowCast training step.

```bash
python experiments/arso/runner/flowcast/test_flowcast.py \\
    --artifacts_folder <path_to_your_artifacts_folder> \\
    --config experiments/arso/runner/flowcast/flowcast_config.yaml \\
    --data_file datasets/arso/final_sequence_data/sequence_data_ds1.h5
```
