The following instructions are for files in the `code/` subdirectory. Other subdirectories like  `motivation/` (Section 1 experiments) and `implementation/` (End to end experiments) have their own READMEs. For data used to generate our results, see the `data/` subdirectory, and for scripts that launch binaries (generated from commands below) to run experiments, see the `scripts` subdirectory. To calculate the algorithmic bandwidth for each data point, take the runtime and divide it by the buffer size (ignore the final column of the output). More code, data, and detailed instructions will be available upon paper acceptance.

To synthesize schedules for power-of-2 number of GPUs (n):
- Run: `python synthesizer_pow2.py <n>` where n is the number of GPUs

To synthesize schedules for a non-power-of-2, even number of GPUs (n):
- Run: `python synthesizer_nonpow2.py <n>` where n is the number of GPUs

To compile AllReduce for 4 GPUs:
- Compile AllReduce: `nvcc -diag-suppress=177 -I${NCCL_HOME}/include -L${NCCL_HOME}/lib -lnccl  -o allreduce allreduce_4GPU.cu -std=c++17`

To compile AllReduce for 8 GPUs:
- Compile AllReduce: `nvcc -diag-suppress=177 -I${NCCL_HOME}/include -L${NCCL_HOME}/lib -lnccl  -o allreduce allreduce_8GPU.cu -std=c++17`

To run AllReduce: 
- `./allreduce <NUM_BYTES> <ALG> <ITERS> <DELAY>` where NUM_BYTES is the buffer size in bytes, ALG is one of ['stragglar', 'ring', 'rhd', 'direct', 'allpairs'] (i.e., [StragglAR, Ring, Recursive Halving and Doubling, Direct, MSCCL]), and delay is the straggler delay time in ms (-1 means we ignore the concept of delay and assume the pre-work, either ReduceScatter for StragglAR or AllReduce for Direct, has already completed)

To obtain the simulation results in the main text:
- Navigate to the `scripts/` subdirectory with: `cd scripts/`
- Run: `python simulation.py`