# GriNNder: Large-Scale Full-Graph Training of Graph Neural Networks on a Single GPU with Storage

This code is the NeurIPS artifact of GriNNder.

Please note that while GDS would further improve the performance, we omitted the setting of GDS (actually, it is included in the last part) for the artifcat for the ease of running the artifact.

In this artifact, we included the code for the default run (e.g., GCN for minimal reproducibility).
We'll open other codes when the paper is got accepted.


## liburing setting for fast I/O

```
sudo apt-get update
sudo apt-get install liburing-dev
# if you run install script, then tensornvme will be installed with liburing
```

## Environment Setting (miniconda)

Python 3.10 / PyTorch 2.5.0 / CUDA 12.4 / DGL & PyG

### Anaconda
```
conda update -n base -c defaults conda
conda create -n grinnder python=3.10
conda activate grinnder
```

### Install Packages
```
chmod +x setup.sh
cd setups
chmod +x setup_*
cd ..
./setup.sh
```

The following run script will run the minimal experiments with the Product dataset :)
Please note that please change the `custom_info.txt` with your information.

```
CUSTOM_DIR=/home/{user}/GriNNder
DATASET_DIR=/home/{user}/datasets
CKPT_DIR=/home/{user}/ckpts
STORAGE_DIR=/home/{user}/storage
```

When running the artifact, you will be asked to download the Products dataset.
Please answer `y` for it :)

### Running the Artifact
```
chmod +x artifact_run.sh
./artifact_run.sh
```

We also provide the automatic parser for the results.
After running the artifact, please run the automatic parser as follows.

### Automatic Parsing of the Results
```
pip install prettytable # for printing a formatted table
python parse_artifact.py
```

### Expected (Example) Results

Even when HongTu can handle the full activations and gradients with snapshots, we can verify that GriNNder still provides significant speedup over it.
Please note that we ran the below example experiments on the OS SSD, not on a dedicated SSD like the main evalutaions in the paper.

```
+------------------------------------------------+
|       Products 3-Layer Summary (seconds)       |
+-----------------------+-------+--------+-------+
| Method / Partitioning | METIS | Random |  GRD  |
+-----------------------+-------+--------+-------+
|         HongTu        | 11.37 | 22.33  | 15.04 |
|          GRD          |  8.01 | 15.01  | 10.29 |
|        Speedup        | 1.42× | 1.49×  | 1.46× |
+-----------------------+-------+--------+-------+

+------------------------------------------------+
|       Products 5-Layer Summary (seconds)       |
+-----------------------+-------+--------+-------+
| Method / Partitioning | METIS | Random |  GRD  |
+-----------------------+-------+--------+-------+
|         HongTu        | 21.97 | 43.06  |  29.3 |
|          GRD          | 15.46 | 29.48  | 20.44 |
|        Speedup        | 1.42× | 1.46×  | 1.43× |
+-----------------------+-------+--------+-------+
```



### ETC) GDS Setting (not essential for running the artifact)

<details>
<summary> GDS-related setting details </summary>

Following https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#mofed-req-install
and https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#nvme-nvmeof-support.

1. disable IOMMU
    ```
    dmesg | grep -i iommu
    ```

    ```
    sudo vim /etc/default/grub
    ###############
    GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=off"
    # if other arguments exist, separate by spacing
    ###############
    sudo update-grub
    sudo reboot
    ```

    ```
    cat /proc/cmdline # check iommu=off
    ```
2. install mellanox ofed

    Download: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/

    I used 24.07 version.

    ```
    tar -zxvf MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu24.04-x86_64.tgz
    cd MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu24.04-x86_64
    sudo apt-get install bzip2
    sudo ./mlnxofedinstall --with-nvmf --with-nfsrdma --enable-gds --add-kernel-support --dkms
    sudo /etc/init.d/openibd restart
    sudo update-initramfs -u -k `uname -r`
    sudo reboot
    ```

3. check whether the NVMe devicd is supported for GDS

    ```
    cat /sys/block/<nvme>/integrity/device_is_integrity_capable
    ```

4. XFS mount of a local filesystem for GDS

    As GDS only supports RAID 0, we need to set RAID 0 XFS with mdadm.
    ```
    # mdadm install
    sudo apt-get update
    sudo apt-get install mdadm
    ```
    ```
    # create RAID0 array
    sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1
    # create XFS filesystem
    sudo mkfs.xfs /dev/md0
    ```
    ```
    # create mount point and mount the filesystem
    sudo mkdir -p /mnt/fast_nvme
    sudo mount /dev/md0 /mnt/fast_nvme
    ```
    ```
    # check the RAID array
    cat /proc/mdstat
    ```
    ```
    # add the RAID array to /etc/fstab for automatic mounting
    echo '/dev/md0 /mnt/fast_nvme xfs defaults 0 0' | sudo tee -a /etc/fstab
    ```
    ```
    # final check
    mount | grep xfs
    ```

From here, lets follow: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#install-gpudirect-storage.

5. Install GDS
    
    Full GDS support is restricted to the following Linux distros: Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04, RHEL 8.y (y <= 10), RHEL 9.y (y <= 4)

    Starting with CUDA toolkit 12.2.2, GDS kernel driver package nvidia-gds version 12.2.2-1 (provided by nvidia-fs-dkms 2.17.5-1) and above is only supported with the NVIDIA open kernel driver. Follow the instructions in Removing CUDA Toolkit and Driver to remove existing NVIDIA driver packages and then follow instructions in NVIDIA Open GPU Kernel Modules to install NVIDIA open kernel driver packages.

    ```
    sudo sh cuda_12.6.1_560.35.03_linux.run --kernel-module-type=open
    # note that select Kernel Objects (nvidia-fs) for GDS
    ```

    Please make sure that
    -   PATH includes /usr/local/cuda-12.6/bin
    -   LD_LIBRARY_PATH includes /usr/local/cuda-12.6/lib64, or, add /usr/local/cuda-12.6/lib64 to /etc/ld.so.conf and run ldconfig as root

    ```
    sudo modprobe nvidia_peermem
    echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf
    ```

    ```
    # Now check
    python /usr/local/cuda-12.6/gds/tools/gdscheck.py -p
    ```

    ```
    GDS release version: 1.11.1.6
    nvidia_fs version:  2.22 libcufile version: 2.12
    Platform: x86_64
    ============
    ENVIRONMENT:
    ============
    =====================
    DRIVER CONFIGURATION:
    =====================
    NVMe               : Supported
    NVMeOF             : Unsupported
    SCSI               : Unsupported
    ScaleFlux CSD      : Unsupported
    NVMesh             : Unsupported
    DDN EXAScaler      : Unsupported
    IBM Spectrum Scale : Unsupported
    NFS                : Unsupported
    BeeGFS             : Unsupported
    WekaFS             : Unsupported
    Userspace RDMA     : Unsupported
    --Mellanox PeerDirect : Enabled
    --rdma library        : Not Loaded (libcufile_rdma.so)
    --rdma devices        : Not configured
    --rdma_device_status  : Up: 0 Down: 0
    =====================
    CUFILE CONFIGURATION:
    =====================
    properties.use_compat_mode : true
    properties.force_compat_mode : false
    properties.gds_rdma_write_support : true
    properties.use_poll_mode : false
    properties.poll_mode_max_size_kb : 4
    properties.max_batch_io_size : 128
    properties.max_batch_io_timeout_msecs : 5
    properties.max_direct_io_size_kb : 1024
    properties.max_device_cache_size_kb : 131072
    properties.max_device_pinned_mem_size_kb : 18014398509481980
    properties.posix_pool_slab_size_kb : 4 1024 16384  
    properties.posix_pool_slab_count : 128 64 32 
    properties.rdma_peer_affinity_policy : RoundRobin 
    properties.rdma_dynamic_routing : 0
    fs.generic.posix_unaligned_writes : false
    fs.lustre.posix_gds_min_kb: 0
    fs.beegfs.posix_gds_min_kb: 0
    fs.weka.rdma_write_support: false
    fs.gpfs.gds_write_support: false
    profile.nvtx : false
    profile.cufile_stats : 0
    miscellaneous.api_check_aggressive : false
    execution.max_io_threads : 0
    execution.max_io_queue_depth : 128
    execution.parallel_io : false
    execution.min_io_threshold_size_kb : 1024
    execution.max_request_parallelism : 0
    properties.force_odirect_mode : false
    properties.prefer_iouring : false
    =========
    GPU INFO:
    =========
    GPU index 0 NVIDIA RTX A5000 bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled
    ==============
    PLATFORM INFO:
    ==============
    IOMMU: disabled
    Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
    Cuda Driver Version Installed:  12060
    Platform: System Product Name, Arch: x86_64(Linux 6.8.0-41-generic)
    Platform verification succeeded
    ```
</details>

<details>
<summary>GDS config and test details</summary>

1. result from the previous test

When conducting the previous test, `cufile.log` will be generated.

You'll easily see the following.
```
12-09-2024 04:21:19:129 [pid=4121196 tid=4121196] ERROR  0:140 unable to load,  liburcu-bp.so.6 
12-09-2024 04:21:19:129 [pid=4121196 tid=4121196] ERROR  0:140 unable to load,  liburcu-bp.so.1 
```

This error should be mitigated, so install the required packages.

```
# we need to force install liburcu6
# because liburcu8 is installed
# download liburcu6 from the web
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/liburcu/liburcu6_0.11.1-2_amd64.deb
sudo apt install ./liburcu6_0.11.1-2_amd64.deb
sudo apt-get install liburcu-dev
```

Then the errors are resolved!

2. cufile setting

```
# add to ~/.bashrc after copy setups/cufile.json into the home directory
export CUFILE_ENV_PATH_JSON=~/cufile.json
# make sure to set properties.allow_compat_mode: false
```

</details>