# Quickstart Guide 
In order to run Swarm on daint, you will need to do a couple of things. 

## Setting up the environment
First off, you'll need to have an environment which has all the packages needed. Most packages are provided by just using `module load cray-python`. However, some aren't. 

### Mpi4py
The mpi4py package can be found with the module system. However, this will not work when you are trying to send tensors which are on the GPU. In order to solve this, you will need to build and install it from scratch. In the files, you will find a `create_mpi4py_module.sh` file. Create a directory where you want to install mpi4py and run the following:
```bash
bash create_mpi4py_module.sh <your_install_dir>
```
Once that is done, you have to make this module available to use by running
```bash
module use <your_install_dir>/modulefiles
```
It makes sense to add this to the `.bashrc` to avoid doing it every time. After this, doing
```bash
module load mpi4py
```
will load the custom mpi4py. 


In order to ensure success, as a sanity check run the following code:
```bash
salloc -N 2 -C gpu -A g34 --partition=normal
MPICH_RDMA_ENABLED_CUDA=1 srun python -c "from mpi4py import MPI"
```

It should run without any problems. 

### Virtual environment
Since we will be using some packages which aren't provided by cray, we will have to create a virtual environment. To do this run:
```bash
module load cray-python
python -m venv --system-site-packages myvenv
```
And you can activate the environment using:
```bash
source ./myvenv/bin/activate
```
And afterwards you have to install the following packages:
```python
pip install numpy Pillow==6.2 tensorboardX protobuf scipy six
```

### PyTorch
Previously, daint didn't used to have PyTorch in their module system. But now, in order to have access to PyTorch, all you have to do is run:
```
module load PyTorch
```
And in case you don't want to use the provided version, you can install it using pip like the previous packages:
```
pip install torch torchvision
```


## Running the code
To run the scripts on daint, there are two ways of going about it. First, you can run the script for Cifar10 using:
```
sbatch job_cifar10.sh
```

### Interactive Jobs
The other way is interactive jobs. In order to get an interactive session enter this:
```
salloc -N <num-nodes> -C gpu -A <account> --partition=<partition-you-want>
```
After that in order to use one-way communications in MPI you must export a global variable like so:
```
export MPICH_RDMA_ENABLED_CUDA=1
```
Afterwards, you import your modules and activate you venv if you haven't already:
```
module load daint-gpu cray-python PyTorch
module load intel
source <addr-to-venv>/myvenv/bin/activate
``` 
And now we can run our stuff using srun:
```
srun python worker_temp.py --dataset-name cifar10 --average-epochs 75 --virtual-epoch-num 200 --batch-size 256 --lr 0.1 --log-interval 10 --weight-decay 0.0005 --local-updates 2 --quantize
```

### Things to be mindful of
1. Quantization
If you want to include quantization, pay attention to the `quant_bits` and `quant_s` parameters. If the quantization is preventing the loss from going down, one possible solution is that the `quant_s` parameter is not properly configured. 

2. `mpi4py.rc.threads`
Whenever you have multithreading or window-creation issues,  try commenting/uncommenting the `mpi4py.rc.threads` line to see if it fixes it. 
 On a local server commenting it would make it crash. On daint, uncommenting it would make it crash.

3. Unexpected `SEGMENTATION_FAULT`:           
One of the downsides of running this code on daint is you never know when you might get an unexpected `SEGMENTATION_FAULT`. Some of the places we have encountered it are as follow:

- Commenting/Uncommenting the `mpi4py.rc.threads` line. 
    
- If you use `win.Get` and select `MPI.BYTE` or any other 1-byte datatypes. 
    To solve this issue, always get a few more bytes such that the total number is a multiple of four and use `(MPI.FLOAT/MPI.INT)` as the datatype.
    
- When creating a window using `MPI.Win.Create`
    This error was more pronounced before daint was updated recently. However, back then, the code would sometimes throw a `SEGMENTATION_FAULT` when it was trying to create the window. We never found out what the issue was and changing a couple of parameters or commenting/uncommenting some lines of code (which had nothing to do with window creation) or adding gibberish lines of extra code solved the issue !!!!
    If you see this happening to you, the only solution we've found was to just keep trying various things and praying each time that it might work this time.  