Accelerate with CUDA-Enhanced Neuron and Layer-by-Layer Propagation
============================================================================================

Authors: `fangwei123456 <https://github.com/fangwei123456>`_

CUDA-Enhanced Neuron
-----------------------
:class:`spikingjelly.activation_based.neuron` provides the multi-step version of neurons. Compared with the single-step neuron,
the multi-step neuron can use cupy backend. The cupy backend fuses operations in a single cuda kernel, which is much faster
than naive pytorch backend. Let us run a simple experiment to compare LIF neurons in both module:

.. code-block:: python

    from spikingjelly.activation_based import neuron, surrogate, cuda_utils
    import torch


    def cal_forward_t(multi_step_neuron, x, repeat_times):
        with torch.no_grad():
            used_t = cuda_utils.cal_fun_t(repeat_times, x.device, multi_step_neuron, x)
            multi_step_neuron.reset()
            return used_t * 1000


    def forward_backward(multi_step_neuron, x):
        multi_step_neuron(x).sum().backward()
        multi_step_neuron.reset()
        x.grad.zero_()


    def cal_forward_backward_t(multi_step_neuron, x, repeat_times):
        x.requires_grad_(True)
        used_t = cuda_utils.cal_fun_t(repeat_times, x.device, forward_backward, multi_step_neuron, x)
        return used_t * 1000


    device = 'cuda:0'
    repeat_times = 1024
    ms_lif = neuron.MultiStepLIFNode(surrogate_function=surrogate.ATan(alpha=2.0))


    ms_lif.to(device)
    N = 2 ** 20
    print('forward')
    ms_lif.eval()
    for T in [8, 16, 32, 64, 128]:
        x = torch.rand(T, N, device=device)
        ms_lif.backend = 'torch'
        print(T, cal_forward_t(ms_lif, x, repeat_times), end=', ')
        ms_lif.backend = 'cupy'
        print(cal_forward_t(ms_lif, x, repeat_times))

    print('forward and backward')
    ms_lif.train()
    for T in [8, 16, 32, 64, 128]:
        x = torch.rand(T, N, device=device)
        ms_lif.backend = 'torch'
        print(T, cal_forward_backward_t(ms_lif, x, repeat_times), end=', ')
        ms_lif.backend = 'cupy'
        print(cal_forward_backward_t(ms_lif, x, repeat_times))

The code is running at a Ubuntu server with `Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz` CPU and `GeForce RTX 2080 Ti` GPU. The outputs are:

.. code-block:: bash

    forward
    8 1.9180845527841939, 0.8166529733273364
    16 3.8143536958727964, 1.6002442711169351
    32 7.6071328955436, 3.2570467449772877
    64 15.181676714490777, 6.82808195671214
    128 30.344632044631226, 14.053565065751172
    forward and backward
    8 8.131792200288146, 1.6501817200662572
    16 21.89934094545265, 3.210343387223702
    32 66.34630815216269, 6.41730432241161
    64 226.20835550819152, 13.073845567419085
    128 827.6064751953811, 26.71502177403795

We plot the results in a bar chart:

.. image:: ../_static/tutorials/activation_based/11_cext_neuron_with_lbl/exe_time_f.*
    :width: 100%

.. image:: ../_static/tutorials/activation_based/11_cext_neuron_with_lbl/exe_time_fb.*
    :width: 100%

It can be found that cupy backend is much faster than naive pytorch backend.

Accelerate Deep SNNs
-----------------------
Now let us use the CUDA-Enhanced Multi-Step neuron to re-implement the network in :doc:`../activation_based_en/4_conv_fashion_mnist` and compare their speeds. There is no need to modify the training codes. We can only change the network's codes:

.. code-block:: python

    class CupyNet(nn.Module):
        def __init__(self, T):
            super().__init__()
            self.T = T

            self.static_conv = nn.Sequential(
                nn.Conv2d(1, 128, kernel_size=3, padding=1, bias=False),
                nn.BatchNorm2d(128),
            )

            self.conv = nn.Sequential(
                neuron.MultiStepIFNode(surrogate_function=surrogate.ATan(), backend='cupy'),
                layer.SeqToANNContainer(
                        nn.MaxPool2d(2, 2),  # 14 * 14
                        nn.Conv2d(128, 128, kernel_size=3, padding=1, bias=False),
                        nn.BatchNorm2d(128),
                ),
                neuron.MultiStepIFNode(surrogate_function=surrogate.ATan(), backend='cupy'),
                layer.SeqToANNContainer(
                    nn.MaxPool2d(2, 2),  # 7 * 7
                    nn.Flatten(),
                ),
            )
            self.fc = nn.Sequential(
                layer.SeqToANNContainer(nn.Linear(128 * 7 * 7, 128 * 4 * 4, bias=False)),
                neuron.MultiStepIFNode(surrogate_function=surrogate.ATan(), backend='cupy'),
                layer.SeqToANNContainer(nn.Linear(128 * 4 * 4, 10, bias=False)),
                neuron.MultiStepIFNode(surrogate_function=surrogate.ATan(), backend='cupy'),
            )


        def forward(self, x):
            x_seq = self.static_conv(x).unsqueeze(0).repeat(self.T, 1, 1, 1, 1)
            # [N, C, H, W] -> [1, N, C, H, W] -> [T, N, C, H, W]

            return self.fc(self.conv(x_seq)).mean(0)

The fully codes are available at :class:`spikingjelly.activation_based.examples.conv_fashion_mnist`. Run this example with the same arguments and devices as those in :doc:`../activation_based_en/4_conv_fashion_mnist`. The outputs are:

.. code-block:: shell

    (pytorch-env) root@e8b6e4800dae4011eb0918702bd7ddedd51c-fangw1598-0:/# python -m spikingjelly.activation_based.examples.conv_fashion_mnist -opt SGD -data_dir /userhome/datasets/FashionMNIST/ -amp -cupy

    Namespace(T=4, T_max=64, amp=True, b=128, cupy=True, data_dir='/userhome/datasets/FashionMNIST/', device='cuda:0', epochs=64, gamma=0.1, j=4, lr=0.1, lr_scheduler='CosALR', momentum=0.9, opt='SGD', out_dir='./logs', resume=None, step_size=32)
    CupyNet(
      (static_conv): Sequential(
        (0): Conv2d(1, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (conv): Sequential(
        (0): MultiStepIFNode(
          v_threshold=1.0, v_reset=0.0, detach_reset=False
          (surrogate_function): ATan(alpha=2.0, spiking=True)
        )
        (1): SeqToANNContainer(
          (module): Sequential(
            (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
            (1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (2): MultiStepIFNode(
          v_threshold=1.0, v_reset=0.0, detach_reset=False
          (surrogate_function): ATan(alpha=2.0, spiking=True)
        )
        (3): SeqToANNContainer(
          (module): Sequential(
            (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
            (1): Flatten(start_dim=1, end_dim=-1)
          )
        )
      )
      (fc): Sequential(
        (0): SeqToANNContainer(
          (module): Linear(in_features=6272, out_features=2048, bias=False)
        )
        (1): MultiStepIFNode(
          v_threshold=1.0, v_reset=0.0, detach_reset=False
          (surrogate_function): ATan(alpha=2.0, spiking=True)
        )
        (2): SeqToANNContainer(
          (module): Linear(in_features=2048, out_features=10, bias=False)
        )
        (3): MultiStepIFNode(
          v_threshold=1.0, v_reset=0.0, detach_reset=False
          (surrogate_function): ATan(alpha=2.0, spiking=True)
        )
      )
    )
    Mkdir ./logs/T_4_b_128_SGD_lr_0.1_CosALR_64_amp_cupy.
    Namespace(T=4, T_max=64, amp=True, b=128, cupy=True, data_dir='/userhome/datasets/FashionMNIST/', device='cuda:0', epochs=64, gamma=0.1, j=4, lr=0.1, lr_scheduler='CosALR', momentum=0.9, opt='SGD', out_dir='./logs', resume=None, step_size=32)
    ./logs/T_4_b_128_SGD_lr_0.1_CosALR_64_amp_cupy
    epoch=0, train_loss=0.028574782584865507, train_acc=0.8175080128205128, test_loss=0.020883125430345536, test_acc=0.8725, max_test_acc=0.8725, total_time=13.037598133087158
    Namespace(T=4, T_max=64, amp=True, b=128, cupy=True, data_dir='/userhome/datasets/FashionMNIST/', device='cuda:0', epochs=64, gamma=0.1, j=4, lr=0.1, lr_scheduler='CosALR', momentum=0.9, opt='SGD', out_dir='./logs', resume=None, step_size=32)
    ./logs/T_4_b_128_SGD_lr_0.1_CosALR_64_amp_cupy

    ...

    epoch=62, train_loss=0.001055751721853287, train_acc=0.9977463942307693, test_loss=0.010815625159442425, test_acc=0.934, max_test_acc=0.9346, total_time=11.059867858886719
    Namespace(T=4, T_max=64, amp=True, b=128, cupy=True, data_dir='/userhome/datasets/FashionMNIST/', device='cuda:0', epochs=64, gamma=0.1, j=4, lr=0.1, lr_scheduler='CosALR', momentum=0.9, opt='SGD', out_dir='./logs', resume=None, step_size=32)
    ./logs/T_4_b_128_SGD_lr_0.1_CosALR_64_amp_cupy
    epoch=63, train_loss=0.0010632637413514631, train_acc=0.9980134882478633, test_loss=0.010720000202953816, test_acc=0.9324, max_test_acc=0.9346, total_time=11.128222703933716

We get 93.46% accuracy, which is very close to 93.3% in :doc:`../activation_based/11_cext_neuron_with_lbl`. Here are training logs:

.. image:: ../_static/tutorials/activation_based/11_cext_neuron_with_lbl/train.*
    :width: 100%

.. image:: ../_static/tutorials/activation_based/11_cext_neuron_with_lbl/test.*
    :width: 100%

In fact, we set an identical seed in both examples, but get a different results, which maybe caused by the numerical errors between cupy and pytorch functions. It can be found that the training execution time with cupy backend is 69% of the naive PyTorch SNN.