# Code and Traces of TeraIO

### Structure of the Repository

This supplemental material includes the following four directories:
- `pytorch-patch`: This is TeraIO's customized PyTorch **patch**, including code support for tensor lifetime analysis and the tensor migration engine. The source code is based on PyTorch 2.5.0. We develop TeraIO based on https://github.com/pytorch/pytorch (commit 584f674aa01f69b0d2cab09f78ea6e035e1b3285).
- `torchtitan`: This directory includes our Python scripts to train Llama3 and Granite models with TeraIO. We develop our code based on https://github.com/pytorch/torchtitan (commit ac90c36e39c6274f9beaf76922627665b6553905).
- `teraio-algorithm`: This directory contains the tensor migration algorithm.
- `megatron-ds`: This directory includes the scripts to run ZeRO-Offload and ZeRO-Infinity. It also contains the data traces we collected during our experiments.

### Reproducing the Experiments of TeraIO (on Linux)

0. Install CUDA, cuFile, and Python.

   To run this project, we need at least one CUDA-compatible GPU. All our experiments are done with 2 GPUs (see details in `torchtitan/test_runner.py::build_test_list()`). We also need to install CUDA (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) and cuFile (https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html).
   Note that when installing cuFile, we must also update the NVMe drivers. Additionally, if the original NVMe drivers are included in initrd, we must update the initrd after installing cuFile and the new NVMe drivers.

   We use Python 3.12.9 in our experiments, and any version starting from 3.12 should work. We recommend using a new Conda environment to manage the TeraIO software stack.

   Apply the PyTorch patch to PyTorch source code (commit 584f674aa01f69b0d2cab09f78ea6e035e1b3285) downloaded from github.

1. Build and install the customized PyTorch for TeraIO.

   ```sh
   # download the source code, update submodules, and apply our patch in pytorch-patch
   cd pytorch
   git submodule sync
   git submodule update --init --recursive

   # install dependencies
   conda install cmake ninja
   pip install -r requirements.txt
   pip install mkl-static mkl-include
   conda install -c pytorch magma-cuda124  # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
   make triton

   # build and install
   export _GLIBCXX_USE_CXX11_ABI=1
   export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
   python setup.py develop
   
   # back to the home directory of TeraIO
   cd ..
   ```

2. Install dependencies for TorchTitan.

   ```sh
   cd torchtitan
   pip install -r requirements.txt
   cd ..
   ```

3. Build RAIDs for tensor migration.

   The following example demonstrates how to **build a RAID-0** (`/dev/md/md${GPUNO}`) using four SSDs (`/dev/${NVME0}`, `/dev/${NVME1}`, `/dev/${NVME2}`, and `/dev/${NVME3}`), and then create a directory (`/mnt/md0/gds_tensor_files${GPUNO}`) for storing tensors offloaded to SSDs via cuFile.

   ```sh
   # build a RAID-0 with multiple SSDs
   # - PASSWD: the password of an account with sudo privileges
   # - GPUNO: the index of the GPU on the machine; for example, if there is only one GPU, this number should be 0
   echo "$PASSWD" | sudo -S bash -c "
     mdadm --create --run --force /dev/md/md${GPUNO} --level=0 --raid-devices=4 /dev/${NVME0} /dev/${NVME1} /dev/${NVME2} /dev/${NVME3}
     mkfs.xfs /dev/md/md${GPUNO} -f
     mount /dev/md/md${GPUNO} /mnt/md${GPUNO}
     chmod 777 /mnt/md${GPUNO}
     mkdir -p /mnt/md${GPUNO}/gds_tensor_files${GPUNO}
   "
   ```

   In practice, since the number of SSDs and GPUs is different from our setup, we can create different numbers RAID-0s (e.g., `/dev/md/md0`, `/dev/md/md1`, etc.) depending on the numbers of GPUs, with each RAID-0 consisting of a different number of SSDs.

   **Note**: To use TeraIO, the mount point of each RAID must follow the format `/mnt/md${GPUNO}` (e.g., `/mnt/md0`, `/mnt/md1`). Additionally, we should create a corresponding `gds_tensor_files${GPUNO}` directory in each mounted RAID (e.g., `/mnt/md0/gds_tensor_files0`, `/mnt/md1/gds_tensor_files1`).

4. Profile tensor lifetime information.

   ```sh
   # run the command in torchtitan
   cd torchtitan

   # command to profile models with limited GPU memory
   # - ngpu: the total number of available GPUs
   # - test: the name of the training configuration to run (we can find all configurations or add new ones in "test_runner.py")
   # - print_info: whether to print TeraIO's log for each kernel
   # - offloadingbw: the emulated bandwidth (GB/s) for tensor offloading; 0 means unlimited bandwidth, -1 means using real SSDs for offloading
   # - liveness_path: the path to store the tensor lifetime information
   rm -rf outputs/; python ./test_runner.py --ngpu ${TOTAL_NGPU} outputs --test ${TEST_CONF_NAME} --print_info False --offloadingbw 0 --liveness_path ../teraio-algorithm/data_${TEST_CONF_NAME}
   # an example could be
   # "rm -rf outputs/; python ./test_runner.py --ngpu 2 outputs --test 8b_64_1k_8ubs --print_info False --offloadingbw 0 --liveness_path ../teraio-algorithm/data_8b_64_1k_8ubs"

   # back to the home directory of TeraIO
   cd ..
   ```

   After running the above commands, the directory `teraio-algorithm/data_${TEST_CONF_NAME}/` will contain subdirectories named `0/`, `1/`, ..., depending on the number of GPUs available. Each subdirectory corresponds to a GPU (e.g., `0/` for GPU 0, `1/` for GPU 1, etc.) and contains a file named `semantics.in`, which stores all tensor lifetime information required by the tensor migration algorithm.

5. Generate tensor migration plans.

   Before generating the tensor migration plan from the lifetime information, the algorithm must be compiled.

   ```sh
   cd teraio-algorithm/src
   make -j
   cd ../.. 
   ```

   In the configuration file, we need to specify the bandwidth of the RAID, the bandwidth of the CPU memory, and the maximum CPU memory capacity that can be used.

   First, create the configuration file in the same directory as the profiled tensor lifetime data. The example below assumes a system with 2 GPUs, so one configuration file is created for each GPU:

   ```sh
   cd teraio-algorithm/data_${TEST_CONF_NAME}/0
   touch demo.config
   cd ../1
   touch demo.config

   # go back to the home directory of this project
   cd ../../..
   ```
   
   Next, the format of the configuration file is shown below. In this example, several variables are closely tied to the underlying hardware. Specifically, you need to replace:

   - `output_dir` with the actual output directory. For example, if the configuration is for Llama3-8B with a batch size of 64, sequence length of 1,024, microbatch number of 8, and GPU memory expanded only via SSDs, a valid output directory could be `torchtitan_0_8b_64_1k_8ubs_gpu80_ssdonly` or `torchtitan_1_8b_64_1k_8ubs_gpu80_ssdonly`, where "0" and "1" stand for the GPU ranks.
   - `speedup` with an integer (e.g., 10, 20, 40) that controls the speed of the migration planning algorithm.
   - `gpu_mem_size` with the available GPU memory (in GB). Please reserve at least 5GB for efficient cuFile operations and GPU memory fragmentation.
   - `ssd_bandwidth` with the RAID bandwidth for mixed read-write workloads. We can measure this using `gdsio` or `fio`, ensuring that the I/O size is larger than 1MB. Note that actual SSD bandwidth may be slightly lower when simultaneous migration to both SSD and CPU memory is enabled.
   - `cpu_mem_bandwidth` with the bandwidth between CPU and GPU. Again, ensure that the I/O size used for measurement is larger than 1MB.
   - `cpu_mem_size` with the maximum CPU memory (in GB) to be used for tensor migration.

   ```sh
   output_folder           ../../results/{output_dir}
   is_simulation           1


   stat_output_file        sim_result

   use_prefetch            1
   algo_speedup            {speedup}

   migration_policy        TERAIOGDSSSD


   GPU_memory_size_GB      {gpu_mem_size}
   GPU_frequency_GHz       1.2
   GPU_PCIe_bandwidth_GBps 64
   GPU_malloc_uspB         0.000000814
   GPU_free_uspB           0

   SSD_PCIe_bandwidth_GBps {ssd_bandwidth}
   SSD_read_latency_us     12
   SSD_write_latency_us    16
   SSD_latency_us          20


   CPU_PCIe_bandwidth_GBps {cpu_mem_bandwidth}
   CPU_memory_line_GB      {cpu_mem_size}


   PCIe_latency_us         5

   delta_parameter         0.5
   ```

   Finally, run the following command and wait for the migration plan to be generated. To generate a plan for another GPU rank or a different model, simply run the same command under the corresponding directory.

   ```sh
   # change to the directory containing the tensor lifetime information (semantics.in)
   cd teraio-algorithm/data_${TEST_CONF_NAME}/0
   ../../src/teraio-algorithm demo.config

   # change to the home directory of this project
   cd ../../..
   ```

6. Train models with generated tensor migration plans.

   We use expandable segments to mitigate GPU memory fragmentation caused by frequent tensor migrations. To enable this feature, please run the following command:

   ```sh
   export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
   ```

   Then, use the following command to run TERAIO and train LLMs under limited GPU memory.

   ```sh
   cd torchtitan

   # command to train models with limited GPU memory
   # - offloadingbw: the emulated bandwidth (in GB/s) used for tensor offloading; set it to 0 to emulate unlimited bandwidth, or -1 to use real SSDs/CPU memory for offloading
   # - plan_path: the path to the generated tensor migration plans; note that regardless of the number of GPUs, always use "X" in the command to represent the GPU rank
   rm -rf outputs/; python ./test_runner.py --ngpu ${TOTAL_NGPU} outputs --test ${TEST_CONF_NAME} --print_info False --offloadingbw -1 --plan_path ../teraio-algorithm/results/torchtitan_X_${TEST_CONF_NAME}_gpu80_ssdonly
   # one example can be
   # "rm -rf outputs/; python ./test_runner.py --ngpu 2 outputs --test 8b_64_1k_8ubs --print_info False --offloadingbw -1 --plan_path ../teraio-algorithm/results/torchtitan_X_8b_64_1k_8ubs_gpu80_ssdonly"

   cd ..
   ```

### Reproducing the Experiments of ZeRO-Offload and ZeRO-Infinity

In `megatron-ds`, we have scripts to run DeepSpeed ZeRO-Offload and ZeRO-Infinity. **Note that** for ZeRO, the default NVMe path is `/dev/mdinf`. So, before running the scripts, please create a RAID and mount it there.

Please refer to the scripts in `run_offload_8b.sh`, `run_inf_8b.sh` for Llama3-8B. Refer to `run_offload_granite.sh` and `run_inf_granite.sh` for IBM Granite. Refer to `run_inf_70b.sh` for Llama3-70B.

### Traces and Data in Our Experiments

1. TeraIO.

   The tensor lifetime information is stored in `teraio-algorithm/data_${TEST_CONF_NAME}`. Accordingly, since our experiments use 2 GPUs, the generated migration plans are located in `teraio-algorithm/results/torchtitan_0_${TEST_CONF_NAME}` and `teraio-algorithm/results/torchtitan_1_${TEST_CONF_NAME}`. Due to limited space of the supplemental material, we only show a few traces.

   The traces collected during our experiments are stored in `torchtitan/logs`, while the logs for the ideal cases of different configurations are located in `torchtitan/logs_ideal`.

2. ZeRO-Offload and ZeRO-Infinity.

   The data we collected for ZeRO-Infinity and ZeRO-Offload is in `megatron-ds\numbers_inf_{model_name}` and `megatron-ds\numbers_offload_{model_name}`, respectively.
