\begin{abstract}
Spiking neural networks are becoming increasingly popular for their low energy requirement in real-world tasks with accuracy comparable to traditional ANNs. SNN training algorithms face the loss of gradient information and non-differentiability due to the Heaviside function in minimizing the model loss over model parameters. To circumvent this problem, the surrogate method employs a differentiable approximation of the Heaviside function in the backward pass, while the forward pass continues to use the Heaviside as the spiking function. We propose to use the zeroth-order technique at the local or neuron level in training SNNs, motivated by its regularizing and potential energy-efficient effects and establish a theoretical connection between it and the existing surrogate methods. We perform experimental validation of the technique on standard static datasets (CIFAR-10, CIFAR-100, ImageNet-100) and neuromorphic datasets (DVS-CIFAR-10, DVS-Gesture, N-Caltech-101, NCARS) and obtain results that offer improvement over the state-of-the-art results. The proposed method also lends itself to efficient implementations of the back-propagation method, which could provide 3-4 times overall speedup in training time. The code is available at \url{https://github.com/BhaskarMukhoty/LocalZO}.
\end{abstract}

\section{Introduction}
Biological neural networks are known to be significantly more energy efficient than their artificial avatars - the artificial neural networks (ANN). Unlike ANNs, biological neurons use spike trains to communicate and process information asynchronously. \cite{mainen1995reliability} To closely emulate biological neurons, spiking neural networks (SNN) use binary activation to send information to the neighbouring neurons when the membrane potential exceeds membrane threshold. The event-driven binary activation simplifies the accumulation of input potential and reduces the computation burden when the spikes are sparse. Specialized neuromorphic hardware \cite{davies2018loihi} is designed to carry out such event-driven and sparse computations in an energy-efficient way \cite{pfeiffer2018deep, kim2020spiking}.

There are broadly three categories of training SNNs: ANN-to-SNN conversion, unsupervised and supervised. The first one is based on the principle that parameters for SNN can be inferred from the corresponding ANN architecture \cite{cao2015spiking,diehl2015fast,bu2021optimal}. Although training SNNs through this method achieves performance comparable to ANNs, it suffers from the long latency needed in SNNs to emulate the corresponding ANN or from retraining of ANNs required to achieve near lossless conversion \cite{davidson2021comparison}. The unsupervised training is biologically inspired and uses local learning to adjust the SNN parameters \cite{diehl2015unsupervised}. Although it is the most energy-efficient one among the three as it is implementable on neuromorphic chips \cite{davies2018loihi}, it still lags in its performance compared to ANN-to-SNN conversion and supervised training.

Finally, supervised training is a method of direct training of SNNs by using back-propagation (through time). As such, it faces two main challenges. The first is due to the nature of SNNs, or more precisely, due to the Heaviside activation of neurons (applied to the difference between the membrane potential and threshold). As the derivative of the Heaviside function is zero, except at zero where it is not defined, back-propagation does not convey any information for the SNN to learn \cite{eshraghian2021training}. One of the most popular ways to circumvent this drawback is to use surrogate methods, where a derivative of a surrogate function is used in the backward pass during training. Due to their simplicity, surrogate methods have been widely used and have seen tremendous success in various supervised learning tasks \cite{shrestha2018slayer, neftci2019surrogate}. However, large and complex network architectures, the time-recursive nature of SNNs, and the fact that the training is oblivious of the sparsity of spikes in SNNs make surrogate methods quite time and energy-consuming. 

Regarding regularization or energy efficiency during direct training of SNNs, only a few methods have been proposed addressing these topics together or separately, most of which deal with the forward propagation in SNNs. For example, \cite{alawad2017stochastic} uses stochastic neurons to increase energy efficiency during inference. More recently, \cite{yan2022sparsereg} uses regularization during the training to increase the sparsity of spikes, reducing the computational burden and energy consumption. Further, \cite{cramer2022surrogate} performs the forward pass on a neuromorphic chip, while the backward pass is performed on a standard GPU. Although these methods improve the SNN models' performance, they do not significantly reduce the computational burden or provide the potential to do so. On the other hand, \cite{nieves2021sparse} introduces a threshold for surrogate gradients (or suggests using only a surrogate with bounded support). However, introducing gradient thresholds has the drawback of limiting the full potential of surrogates during training.


This paper proposes a direct training method for SNNs based on the zeroth order technique. We apply it locally, at the neuronal level - hence dubbed Local Zeroth Order (\lzo) - with twofold benefits: regularization, which comes as a side-effect of the introduced randomness that is naturally associated with this technique, as well as a threshold for gradient backpropagation in the style of \cite{nieves2021sparse} which translates to potential energy-efficient training when properly implemented. 

We summarize the main contributions of the paper as follows: 
\begin{itemize}
\item We introduce zeroth order techniques in SNN training at a local level. We provide extensive theoretical properties of the method, relating it to the surrogate gradients via the internal distributions used in \lzo.
\item We experimentally demonstrate the main properties of \lzo: its superior performance compared to baselines when it comes to generalizations, its ability to simulate arbitrary surrogates as well as its property to speed up the training process, which translates to energy-efficient training.
\end{itemize}


% In this paper, we propose a direct training method for SNNs which encompasses the full power of surrogate gradients in an energy efficient way. Based on zeroth order techniques and applied locally at neuronal level - hence dubbed Local Zeroth Order (\lzo) - our method is able to simulate arbitrary surrogate functions during the training, and at the same time significantly reduce the number of computational steps in the backward pass, which directly translates to energy saving.




\section{Background}

\subsection{Spiking neuron dynamics}
An SNN consists of Leaky Integrate and Fire neurons (LIF) governed by differential equations in continuous time \cite{gerstner2014neuronal}. They are generally approximated by discrete dynamics given in the form of recurrence equations,
\begin{align}
	u^{(l)}_{i}[t] &= \beta u^{(l)}_{i}[t-1] + \sum_{j} w_{ij} x^{(l-1)}_{j}[t] - x^{(l)}_{i}[t-1] u_{th},\nonumber\\
	x^{(l)}_{i}[t] &= h(u^{(l)}_{i}[t] - u_{th} ) = \begin{cases} 1 & \text{if } u^{(l)}_{i}[t] > u_{th} \\
		0 & \text{otherwise,}	
	\end{cases}
 \label{eq:lif_discrete}
\end{align}
where $u^{(l)}_{i}[t]$ denote the membrane potential of $i$-th neuron in the layer $l$ at time-step (discrete) $t$, which recurrently depends upon its previous potential (with scaling factor $\beta < 1$) and spikes $x^{(l-1)}_{j}[t]$ received from the neurons of previous layers weighted by $w_{ij}$. The neuron generates binary spike $x^{(l)}_{i}[t]$ whenever the membrane potential exceeds threshold $u_{th}$, represented by the Heaviside function $h$, followed by a reset effect on the membrane potential. 

To implement the back-propagation of training loss through the network, one must obtain a derivative of the spike function, which poses a significant challenge in its original form represented as:
\begin{align}
    \frac{dx_i[t]}{du} = \begin{cases}
	\infty & \text{if } u_i^{(l)}[t]= u_{th}\\
	0 & \text{otherwise.}
\end{cases}
\end{align}
where we denote $u:= u_i^{(l)}[t] - u_{th}$. To avoid the entire gradient becoming zero, known as the dead neuron problem, the surrogate gradient method (referred to as \sur) redefines the derivative using a surrogate:   
\begin{align}
    \frac{dx_i[t]}{du}:=g(u)
\end{align}
Here, the function $g(u)$ can be, for example, the derivative of the Sigmoid function (see section \ref{sec:surr_to_dist}), but in general, one takes a scaled probability density function as a surrogate (see Section \ref{sec surrogate functions} for more details). 

\subsection{Motivation}\label{sec motivation}
Classically, the purpose of dropout in ANNs is to prevent a complex and powerful network from over-fitting the training data, which consequently implies better generalization properties \cite{baldi2013understanding}. In the forward pass, one usually assigns to each neuron in a targeted layer a probability of being ``switched-off'' during both forward and backward passes, and this probability does not change during the training. Moreover, the ``activity'' of the neuron, however we may define it, does not affect whether the neuron will be switched on or off.

Our motivation comes along these lines: how to introduce a dropout-like regularizing effect in the training of SNNs, but keeping in mind the temporal dimension of the data, as well as the neuron activity at that particular moment (heuristically, a more active neuron would be kept ``on'' with a high probability (randomness of the dropout) while a less active one would be ``switched off'', again with high probability, in a sense to be made precise shortly). Generally speaking, our idea consists of the following two steps: 1) For each spiking neuron of our SNN network, measure how active the neuron is in the forward pass at each time step $t$. Here, we define the activity based on how far the current membrane potential of the neuron $u[t]$ is from the firing threshold $u_{th}$ (this idea comes from \cite{nieves2021sparse}). However, unlike in \cite{nieves2021sparse} where the sole distance is the determining factor, we introduce the effect of randomness via a fixed PDF, say $\lambda$, sample $z$ from it and say the neuron is active at time $t$ if $|u[t]-u_{th}|<c|z|$, where $c$ is some upfront fixed constant. 2) In the backward pass at the time $t$, if the neuron is dubbed active, we will apply some surrogate function $g(u[t]-u_{th})$; otherwise, we will take the surrogate to be 0 (hence, switching off the propagation of gradients through the neuron in the latter case). 

Having said this, we ask ourselves the final question: can we have a systematic way of choosing functions $\lambda$ and $g$ so that the expected surrogate we use (with respect to $\lambda$) equals the one we chose upfront? A simple yet elegant solution that satisfies all of the above comes with zeroth order methods.

Zeroth order technique is a popular gradient-free method \cite{liu2020primer}, well studied in neural networks literature. To briefly introduce it, we consider a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$, that we intend to minimize using gradient descent, for which the gradient may not be available or even undefined. The zeroth-order method estimates the gradients using function outputs:  given a scalar $\delta > 0$, the 2-point ZO is defined as
\begin{align}
G^{2}(\m w; \m z, \delta) = \phi(d) \frac{f(\m w+\delta \m z )-f(\m w-\delta \m z )}{2\delta} \m z    
\end{align}
where, $\m z \sim \lambda$ is a random direction with $\E{\norm{\m z}^2}{z \sim \lambda}=1$ and $\phi(d)$ is a dimension dependent factor, with $d$ being the dimension. However, to approximate the full gradient of $f$ up to a constant squared error, we need an average of $O(d)$ samples of $G^2$, which becomes computationally challenging when $d$ is large, such as the number of learnable parameters of the neural network. Though well studied in the literature, properties of 2-point ZO are known only for the continuous functions \cite{nesterov2017random, berahas2022theoretical}. In the present context, we will apply it locally to the Heaviside function that produces the outputs of spiking neurons, and we justify this by providing the necessary theoretical background for doing so. 


\section{The \lzo algorithm}
Applying ZO on a global scale is challenging due to the large dimensionality of neural networks\cite{li2021rethinking}. Since the non-differentiability of SNN is introduced by the Heaviside function at the neuronal level, we apply the 2-point ZO method on $h: \real{} \rightarrow \{0,1\}$ itself,
\begin{align}
\label{eq:lzo}
    G^{2}( u; z, \delta) &= \frac{h(u +z\delta )-h(u-z\delta )}{2\delta}z = \begin{cases}
	0, & \abs{u} > \abs{z}\delta\\
	\frac{\abs{z}}{2 \delta}, & \abs{u} < \abs{z}\delta\\
\end{cases}
\end{align}

where $u= u_i^{(l)}[t] - u_{th}$ and $z$ is sampled from some distribution $\lambda$. We may average the 2-point ZO gradient over a few samples $z_k$ so that the \lzo derivative of the spike function is defined as:
\begin{equation}
    \frac{dx_i[t]}{dt}:=\frac{1}{m}\sum_{k=1}^{m} G^{2}(u; z_k, \delta)
    \label{eqn:lzo_sum}
\end{equation}
where, the number of samples, $m$, is a hyper-parameter to the \lzo method. We implement this at the neuronal level of the back-propagation routine, where the forward pass uses the Heaviside function, and the backward pass uses equation \eqref{eqn:lzo_sum}. Note that the gradient $\frac{dx_i[t]}{dt}$ being non-zero naturally determines the active neurons of the backward pass (as was discussed in Section \ref{sec motivation}), which can be inferred from the forward pass through the neuron. Algorithm \ref{algo:lzo} gives an abstract representation of the process at a neuronal level, which hints that the backward call is redundant when the neuron has a zero gradient.  

%At each layer we sample $z_k$ for $mBNT$ times, where $B, N,$ and $T$ denote the batch size, number of neurons in the layer, and latency of the network, and store the required gradient information for the backward pass for only the active neurons.    
%Thus, while all the three methods follow the same computation process of layer-wise back-propagation, only \lzo and \spgd gets the benefits of computational saving due to inactive neurons. 

\begin{wrapfigure}[13]{r}{0.5\textwidth }
\vspace{-8mm}
\begin{minipage}{0.5\textwidth}
\begin{algorithm}[H]
	\caption{\lzo}
	\label{algo:lzo}
	\begin{algorithmic}
	{\small
            \STATE \hspace{-3.5mm} \textbf{Forward} %
            \REQUIRE $u := u_i^{(l)}[t] - u_{th}$, dist. $\lambda$, const. $\delta, m$
            \STATE sample $z_1, z_2, \cdots z_m \sim \lambda$
            \STATE $grad \leftarrow \frac{1}{m}\sum_{k=1}^{m}\mathbb{I}(\abs{u}< \delta \abs{z_k}) \frac{\abs{z_k}}{2\delta}$
            \IF {$grad \neq 0$} 
                \STATE SaveForBackward($grad$)
            \ENDIF
            \STATE \textbf{return} $\mathbb{I}(u> 0)$
            \vspace{1mm}
            \\\hrule
            \vspace{1mm}
            \STATE \hspace{-3.5mm}\textbf{Backward} \, \COMMENT{Invoked if grad is non-zero}
            \REQUIRE gradient from chain rule: $grad\_input$
            \STATE \textbf{return} $grad\_input * grad$ 
            %\STATE 
	}
\end{algorithmic}
\end{algorithm}
\end{minipage}
\end{wrapfigure}

\section{Theoretical Properties of \lzo} % 
\subsection{General ZO function}
For the theoretical results around \lzo, we consider a more general function than what was suggested by eqn. \ref{eq:lzo}, in the form
\begin{equation}\label{eq: zo_general}
G^2(u;z,\delta)=\begin{cases}
    0, \quad |u|> |z|\delta\\
    \frac{|z|^\alpha}{2\delta},\quad |u|\leq |z|\delta,
\end{cases}
\end{equation}
where the new constant $\alpha$ is an integer different from 0, while $\delta$ is a positive real number (so, for example, setting $\alpha=1$ in \eqref{eq: zo_general}, we obtain \eqref{eq:lzo}).

The integer $\alpha$ is somewhat a normalizing constant, which allows obtaining different surrogates as the expectation of function $G^2(u; z,\delta)$ when $z$ is sampled from suitable distributions. In practice, taking $\alpha=\pm 1$ will suffice to account for most of the surrogates found in the literature. The role of $\delta$ is somewhat different, as it controls the ``shape'' of the surrogate (narrowing it and stretching around zero). The role of each constant will be more evident from what follows (see section \ref{sec:surr_to_dist}).

\subsection{Surrogate functions} \label{sec surrogate functions}
\begin{defn}
\label{def:surr}
We say that a function $g:\R\to \R_{\geq 0}$ is a surrogate function (gradient surrogate) if it is even, non-decreasing on the interval $(-\infty,0)$ and $c:=\int_{-\infty}^\infty g(z)dz<\infty$. 
\end{defn}
Note that the integral $\int_{-\infty}^\infty g(z)dz$ is convergent (as $g(z)$ is non-negative), but possibly can be $\infty$ and the last condition means that the function $\frac{1}{c}g(t)$ is a probability density function. The first two conditions, that is, requirements for the function to be even and non-decreasing, are not essential but rather practical and consistent with examples from SNN literature. 

Note that the function $G:\R\to [0,1]$, defined as $G(t):=\frac{1}{c}\int_{-\infty}^t g(z)dz$ is the corresponding cumulative distribution function (for PDF $\frac{1}{c}g(t)$). Moreover, it is not difficult to see that its graph is ``symmetric'' around point $(0,\frac{1}{2})$ (or in more precise terms, $G(t)=1-G(-t)$), hence $G(t)$ can be seen as an approximation of Heaviside function $h(t)$. Then, its derivative $\frac{d}{dt}G(t)=\frac{1}{c}g(t)$ can serve as an approximation of the ``derivative'' of $h(t)$, or in other words, as its surrogate, which somewhat justifies the terminology. 

Finally, one may note that ``true'' surrogates would correspond to those functions $g$ for which $c=1$. However, the reason we allow $c$ to be different from 1 is again practical and simplifies the derivation of the results that follow. We note once again that allowing general $c$ is in consistency with examples used in the literature.

\subsection{Surrogates and ZO}
To be in line with classic results around the ZO method and gradient approximation of functions, we pose ourselves two fundamental questions: What sort of functions in variable $u$ can be obtained as the expectation of $G^2(u;z,\delta)$ when $z$ is sampled from a suitable distribution $\lambda$, and, given some function $g(u)$, can we find a distribution $\lambda$ such that we obtain $g(u)$ in the expectation when $z$ is sampled from $\lambda$? 

Two theorems that follow answer these questions and are the core of this section. The main player in both of the questions is the expected value of $G^2(u;z,\delta)$, so we start by analyzing it more precisely. Let $\lambda$ be a distribution, $\lambda(t)$ its PDF for which we assume that it is even and that $\int_0^{\infty}z^\alpha\lambda(z)dz<\infty$. Then, we may write
\begin{align}
\E{&G^{2}(u; z, \delta)}{z \sim \lambda }=\int\limits_{-\infty}^{\infty}G^{2}(u; z, \delta)\lambda(z)dz = \int\limits_{|u|\leq |z|\delta}\frac{|z|^\alpha}{2\delta}\lambda(z)dz  =\frac{1}{\delta}\int\limits_{\frac{|u|}{\delta}}^\infty z^\alpha \lambda(z) dz. \label{eq: expected zo}
\end{align}
It becomes apparent from eqn. \eqref{eq: expected zo} that $\E{G^{2}(u; z, \delta)}{z \sim \lambda }$ has some properties of surrogate functions (it is even and non-decreasing on $\R_{<0}$). The proofs of the following results are detailed in the appendix \ref{app:proof}. 
\begin{restatable}{lem}{lemmaone}
    Assume further that $\int_{0}^{\infty}z^{\alpha+1}\lambda(z)dz<\infty$. Then, $\E{G^{2}(u; z, \delta)}{z \sim \lambda }$ is a surrogate function.
\end{restatable}

\begin{restatable}{thm}{thmone}
\label{thm: expectation}
Let $\lambda$ be a distribution and $\lambda(t)$ its corresponding PDF. Assume that integrals $\int_0^\infty t^\alpha \lambda(t)dt$ and $\int_0^\infty t^{\alpha+1}\lambda(t)dt$ exist and are finite. Let further $\tilde{\lambda}$ be the distribution with corresponding PDF function
$$
\Tilde{\lambda}(z) = \frac{1}{c} \int\limits_{|z|}^\infty t^\alpha \lambda(t) dt,
$$
where $c$ is the scaling constant (such that $\int_{-\infty}^\infty \tilde{\lambda}(z)dz=1$). 
Then,
$$
\E{G^{2}(u; z, \delta)}{z \sim \lambda }=\frac{d}{du} \E{c\, h(u+\delta z)}{z \sim \tilde{\lambda} }.
$$
\end{restatable}
For our next result, which answers the second question we asked at the beginning of this section, note that a surrogate function is differentiable almost everywhere, which follows from the Lebesgue theorem on the differentiability of monotone functions. So, taking derivatives here is understood in an ``almost everywhere'' sense.

\begin{restatable}{thm}{thmtwo}
\label{thm: main2}
    Let $g(u)$ be a surrogate function. Suppose further that $c=-2\delta^2\int_{0}^\infty \frac{1}{z^\alpha}g'(z\delta)dz <\infty$ and put $\lambda(z) = -\frac{\delta^2}{c z^\alpha}g'(z\delta)$ (so that $\lambda(z)$ is a PDF). Then,
    $$c\,\E{G^2(u;z,\delta)}{z\sim \lambda} = \E{c\,G^2(u;z,\delta)}{z\sim \lambda}  = g(u).
    $$
\end{restatable}

\subsection{Application of Theorem \ref{thm: expectation} and \ref{thm: main2}} 
\label{sec:dist_to_sur}
\label{sec:surr_to_dist}
Next, we spell out the results of Theorem \ref{thm: expectation} applied to some standard distributions, with $\alpha=1$. For clarity, all the distributions' parameters are chosen so that the scaling constant of the resulting surrogate is 1. One may consult Figure \ref{fig:lzo_distribution} for the visual representation of the results, while the details are provided in the appendix \ref{app:dist_to_surr}. Recall that the standard normal distribution $N(0,1)$ has PDF of the form $\frac{1}{\sqrt{2\pi}}\exp(-\frac{z^2}{2})$. Consequently, it is straightforward to obtain
\begin{align}
\E{G^{2}(u; z, \delta)}{z \sim \lambda } &=  \frac{1}{\sqrt{2 \pi}}\int_{-\infty}^{\infty} \frac{\abs{z}}{2 \delta} \exp(-\frac{z^2}{2}) dz =\frac{1}{\delta \sqrt{2\pi}} \exp(-\frac{u^2}{2\delta^2}).
\end{align} 
In appendix, \ref{app:dist_to_surr}, we further derive surrogates when $z$ is sampled from Uniform and Laplace distribution.

We recall that Theorem \ref{thm: main2} provides a way to derive distributions for arbitrary surrogate functions ( that satisfy the conditions of the theorem). Consider the Sigmoid surrogate function, where the differentiable Sigmoid function approximates the Heaviside \cite{zenke2018superspike}. The corresponding surrogate gradient is given by,
\begin{align*}
   \frac{dx}{du} = \frac{d}{du}\frac{1}{1 + \exp(-k u)} = \frac{k \exp(-k u)}{(1 + \exp(-ku))^2} =: g(u)
\end{align*}
%and, $$g'(u) = -\frac{k^2 \exp(-ku)(1-\exp(-ku))}{(1+\exp(-ku))^3}$$
Observe that $g(u)$ satisfies our definition of a surrogate ($g(u)$ being even, non-decreasing on $(-\infty,0)$ and 
$\int_{-\infty}^{\infty} g(u) du = 1<\infty$). Thus, according to Theorem \ref{thm: main2}, we have
$c=-2\delta^2\int_{0}^\infty \frac{g'(t\delta)}{t}dt= \frac{\delta^2 k^2}{a^2}$ where,  $a:=\sqrt{\frac{1}{0.4262}}$. The corresponding PDF is given by
\begin{align}
    \lambda(z)=-\frac{\delta^2}{c}\frac{g'(\delta t)}{z}
    =a^2\frac{\exp(-k \delta z)(1-\exp(-k\delta z))}{z(1+\exp(-k\delta z))^3}
\end{align}
Observe that the temperature parameter $k$ comes from the surrogate to be simulated, while $\delta$ is used by \lzo. 
Appendix \ref{app:surr_to_dist} provides calculations and distribution corresponding to the popular Fast-sigmoid surrogate, followed by a description of the inverse sampling method that can be used to simulate sampling for arbitrary distributions using the uniform distribution.

\begin{figure*}
%\vskip 0.2in
\begin{center}
\centerline{
\includegraphics[width=0.32 \textwidth]{plots/gaussian_derivative-crop.pdf}
\includegraphics[width=0.32 \textwidth]{plots/uniform_derivative-crop.pdf}
\includegraphics[width=0.32 \textwidth]{plots/laplacian_derivative-crop.pdf}}
\caption{The figure shows the expected surrogates derived in section \ref{sec:dist_to_sur} as $z$ is sampled from Normal$(0,1)$, Unif$([\sqrt{3}, \sqrt{3}])$ and Laplace$(0, \frac{1}{\sqrt{2}})$ respectively. Each figure shows the surrogates corresponding to $\delta \rightarrow 0$, $\delta=0.5$ and $1$. The surrogates are supplied to \spgd methods for a fair comparison with \lzo as the latter uses respective distributions to sample $z$.}
\label{fig:lzo_distribution}
\end{center}
\vskip -0.3in
\end{figure*}

\subsection{Expected back-propagation threshold for \lzo}
\label{sec:expected_th}
To study the energy efficiency of the \lzo method when training SNNs, we compute the expected threshold $\tilde{B}_{th}$ for the activity of the neurons, i.e. the expectation of the quantity $\abs{z}\delta$ when $z$ is sampled from a distribution $\lambda$. It is used in the experimental section when comparing our method with the alternative energy-efficient method \cite{nieves2021sparse}. The expected threshold values are presented in Table \ref{tab:expected_th} ($m$ denotes the number of samples used in \eqref{eqn:lzo_sum}), while the details of the derivations can be found in \ref{app:threshold}.

\begin{wraptable}[9]{r}{0.5\textwidth}  %[11]{L}{0.5\textwidth }
\vspace{-6mm}
\begin{minipage}{0.5\textwidth}
 %\begin{table}
\caption{The expected back-propagation thresholds}
\label{tab:expected_th}
\begin{center}
\begin{footnotesize}
\begin{tabular}{cccc} 
\toprule
\multicolumn{2}{c}{$z \sim \lambda$ } & \multicolumn{2}{c}{$\tilde{B}_{th}/\delta $ }\\
\hline
$\lambda$& $F_{\abs{z_k}}(x)$ & 	$m=1$ & $m=5$  \\
\hline
Normal$(0,1)$ &$\erf(\frac{x}{\sqrt{2}})$ &$0.798$ & $1.569$ \\
Unif$([\sqrt{3}, \sqrt{3}])$ & $\frac{x}{\sqrt{3}}$ &$0.866$ & $1.443$ \\
Laplace$(0, \frac{1}{\sqrt{2}})$& $1-e^{-\sqrt{2} x}$ &$0.707$ & $1.615$\\
\bottomrule
\end{tabular}
\end{footnotesize}
\end{center}
% \vskip -0.2in
 %\end{table}
\end{minipage}
\end{wraptable}




\section{Experiments} 
\subsection{General Performance of \lzo}
 First, we evaluate the generalization performance of \lzo as a substitute for the surrogate method on standard static image datasets such as CIFAR-10, CIFAR-100\cite{krizhevsky2009learning}, ImageNet-100\cite{deng2009imagenet} and neuromorphic datasets such as DVS-CIFAR-10\cite{li2017cifar10}, DVS-Gesture\cite{amir2017low}, N-Caltech-101\cite{orchard2015converting}, N-CARS\cite{sironi2018hats}. More specifically, by substituting surrogate gradients with gradients computed with our method, we combined \lzo with contemporary state-of-the-art methods in direct training of SNNs, such as tDBN \cite{zheng2021going} and TET \cite{deng2022temporal}. We opted for these two techniques due to their high performance and different natures (the former is a batch normalization technique, while the latter introduces auxiliary loss in training). We refer to the combined methods as \lzotd and \lzott, respectively. Following the recent results, we choose ResNet-19\cite{zheng2021going} architecture for CIFAR datasets, SEW-ResNet-34\cite{fang2021deep} for ImageNet-100, and VGGSNN\cite{deng2022temporal} architecture for the neuromorphic datasets. Table\ref{tab:acc_comp} summarizes these results, where \lzo is implemented with $m=5, \delta=0.5$, except for DVS-CIFAR-10 where we have used $m=1$. Table \ref{tab:parameter_general} reports the detailed training hyper-parameters for each dataset.  

\textbf{Results on Static Datasets}: The static datasets CIFAR-10, CIFAR-100, and ImageNet-100, have the number of classes mentioned in the dataset name, while each class respectively have (5000, 1000), (500, 100), and (1300, 50) train and test images. We use constant encoding to supply the images to the SNN network, with standard latencies as mentioned in Table \ref{tab:acc_comp}. We train separate models for different latencies and report the test results respectively. The results of CIFAR datasets are reported with cutout augmentation following TET\cite{deng2022temporal}, while for ImageNet-100, we use standard data augmentation (random resized crop, random horizontal flip), without and with the ImageNet Policy\cite{cubuk2018autoaugment}.   We note that in conjunction with \lzo, both tDBN and TET improve their performance significantly for all the latencies. For example, in CIFAR-10, \lzo improves TET between $0.9 - 1\%$ for different latencies, while for CIFAR-100, it improves TET by $2.5 - 3.5 \%$. The ImageNet-100 results reported in Table \ref{tab:acc_comp} were further enhanced to 83.33\% with $m=20$ and batch size 72. 

\textbf{Results on Neuromorphic Datasets}: Events of the neuromorphic datasets are collected into event frames of dimension $(2 \times H \times W)$ where H and W stand for the height and width of the frame and are resized to $(2 \times 48 \times 48)$. The temporal events are collected into a fixed number (10) of frames (a.k.a. bins), treated as the effective temporal dimension for the SNN. The experiments are reported without and with standard data augmentation. For DVS-CIFAR-10, the TET result is re-computed to avoid an obscure frame preparation step, which is replaced by an open-source routine.  We obtain results superior to the state-of-the-art for DVS-CIFAR-10\cite{deng2022temporal} and DVS-Gesture\cite{fang2021deep}. For N-Caltech-101 and NCARS, the improvements are 1.2\% and 2.4\%, respectively, compared to the state-of-the-art\cite{gehrig2019end, schaefer2022aegnn}.


%For the augmented data with the random horizontal flip and random crop, we obtain better performance with \lzott compared to TET. 
%is a standard neuromorphic dataset where only 1000 images of CIFAR-10 per class are chosen for conversion to event data using a dynamic vision sensor. As input to the SNN model, the event data is accumulated in a few frames resized to $(48 \times 48)$, and the number of frames becomes the latency. The event stream consists of time-stamp, pixel location, and intensity change information.

\begin{table}
  \centering
  \caption{Comparison with the existing methods show that \lzo improves the accuracy of existing direct training algorithms. For the existing methods, we compare the performance with the results reported in respective literatures. For the rows with two accuracies reported, the second one is for training with additional augmentation.}
  \label{tab:acc_comp}
  \begin{footnotesize}
  \begin{tabular}{ccccc}
    \toprule
    Dataset & Methods & Architecture & Simulation Length & Accuracy \\
    \midrule
      & Hybrid training\cite{rathi2020enabling} & ResNet-20 & 250 & 92.22 \\
             & Diet-SNN\cite{rathi2020diet} & ResNet-20 & 10 & 92.54 \\
            & STBP\cite{wu2018spatio} & CIFARNet & 12 & 89.83 \\
             & STBP NeuNorm\cite{wu2019direct} & CIFARNet & 12 & 90.53 \\
             & TSSL-BP\cite{zhang2020temporal} & CIFARNet & 5 & 91.41 \\
             \cline{2-5}
             & & & 6 & 93.16 \\
                       & tDBN\cite{zheng2021going}             &     ResNet-19        &           4 & 92.92 \\
                                    &            &            & 2 & 92.34 \\
                                                \cline{2-5}
          CIFAR10  &  &  & 6 & 95.07 \\
                       &  \textbf{\lzo+tDBN}   &  ResNet-19          & 4 & 94.89 \\
                       &     &            & 2 & 94.65 \\
           \cline{2-5}
              &  &  & 6 & 94.50 \\
                       & TET\cite{deng2022temporal}    &   ResNet-19  & 4 & 94.44 \\
                       &     &            & 2 & 94.16 \\
                                     \cline{2-5}
              &  &  & 6 & \textbf{95.56} \\
                       &  \textbf{\lzo+TET}   &   ResNet-19  & 4 & \textbf{95.3} \\
                       &     &            & 2 & \textbf{95.03} \\
           
    \midrule
     & Hybrid training\cite{rathi2020enabling} & VGG-11 & 125 & 67.87 \\
            & Diet-SNN\cite{rathi2020diet} & ResNet-20 & 5 & 64.07 \\
             \cline{2-5}
              &  &  & 6 & 71.12 \\
                                     &   tDBN\cite{zheng2021going}         & ResNet-19           & 4 & 70.86 \\
                                     &            &            & 2 & 69.41 \\
            \cline{2-5}
              &     &&            6 & 73.74 \\
               CIFAR100                       & \textbf{\lzo+tDBN}  &  ResNet-19   & 4 & 74.13 \\
                                     &            &            & 2 & 72.78 \\                        
            \cline{2-5}
              &   &  & 6 & 74.72 \\
                        & TET\cite{deng2022temporal}    &  ResNet-19 & 4 & 74.47 \\
                        &     &            & 2 & 72.87 \\
                          \cline{2-5}
              &  & & 6 & \textbf{77.25} \\
                        & \textbf{\lzo+TET}     &    ResNet-19 & 4 & \textbf{76.89} \\
                        &     &            & 2 & \textbf{76.36} \\          
           
    \midrule
    ImageNet-100 &EfficientLIF-Net\cite{kim2023sharing} &ResNet-19& 5 & 79.44\\
    &\textbf{\lzo+TET} & SEW-Resnet34 & 4 &  78.58, \textbf{81.56}$^\ddagger$\\
    \midrule
     & tdBN\cite{zheng2021going} & ResNet-19 & 10 & 67.8 \\
   & Streaming Rollout~\cite{kugele2020efficient} & DenseNet & 10 & 66.8 \\
  & Conv3D\cite{wu2021liaf} & LIAF-Net & 10 & 71.70 \\
   DVS-CIFAR10 & LIAF\cite{wu2021liaf} & LIAF-Net & 10 & 70.40 \\
  & TET\cite{deng2022temporal} & VGGSNN & 10 & 74.89$^\star$, 81.45$^\star$ \\ %77.33, 83.17 
  & \textbf{\lzo+tDBN} & VGGSNN & 10 &  72.6, 79.37\\
  & \textbf{\lzo+TET} & VGGSNN & 10 &  \textbf{75.62}, \textbf{81.87}\\
    \midrule
    & AEGNN\cite{schaefer2022aegnn}&GNN& - & 66.8\\
  N-Caltech-101   &EST\cite{gehrig2019end}&ResNet-34$^\dagger$ & 9 &81.7\\
    & \textbf{\lzo+tDBN} & VGGSNN & 10 &  74.65, 79.05\\
    & \textbf{\lzo+TET} & VGGSNN & 10 &  \textbf{79.86, 82.99}\\
    \midrule
    & AEGNN\cite{schaefer2022aegnn}&GNN& - & 94.5\\ 
   N-CARS  & EST\cite{gehrig2019end}&ResNet-34$^\dagger$& 9 & 92.5\\
   & \textbf{\lzo+tDBN} & VGGSNN & 10 &  95.96, 95.68\\
    & \textbf{\lzo+TET} & VGGSNN & 10 &  \textbf{96.78}, \textbf{96.96}\\
    \midrule
    DVS-Gesture & SEW\cite{fang2021deep} & SEW-Resnet & 16 & 97.92\\
      & \textbf{\lzo+TET} & VGGSNN & 10 &  98.04, \textbf{98.43}\\
 \toprule
 \multicolumn{5}{c}{$^\star$ our implementation, $^\dagger$pre-trained with ImageNet, $^\ddagger$ 83.33 \% with $m=20$}
 \end{tabular}
 \end{footnotesize}
 %\vspace{-3mm}
\end{table}

\subsection{Performance on Energy Efficient Implementation}
\begin{figure}
%\vskip 0.2in
\begin{center}
\centerline{\includegraphics[width=0.32\linewidth]{plots/loss_nmnist_sigmoid_1-crop.pdf}
\includegraphics[width=0.32\linewidth]{plots/speedup_overall_sigmoid_1-crop.pdf}
\includegraphics[width=0.32\linewidth]{plots/total_active-crop.pdf}}
\caption{We plot training loss, overall speedup, and percentage of active neurons for the Sigmoid surrogate, as reported in Table \ref{tab:dataset_SHD}. The \lzo algorithm converges faster than the \spgd method while having a similar overall speedup. The percentage of active neurons being less than 0.6\% explains the reduced computational requirement, which translates to backward speedup.}
\label{fig:sigmoid}
\end{center}
\vskip -0.3in
\end{figure}

In the energy-efficient implementation of the back-propagation \cite{nieves2021sparse}, the optimization of the network weights takes place in a layer-wise fashion through the unrolling of recurrence of equation \eqref{eq:lif_discrete} w.r.t time. As the active neurons of each layer for every time step are inferred from the forward pass, gradients of only active neurons are required to be saved for the backward pass, hence saving the computation requirement of the backward pass. One may refer to \cite{nieves2021sparse} for further details of this implementation framework. To compare, we supply \spgd method the surrogate approximated by \lzo, as per section \ref{sec:dist_to_sur}. The \spgd algorithm also requires a back-propagation threshold parameter, $B_{th}$, to control the number of active neurons participating in the back-propagation. We supply it the expected back-propagation threshold $\tilde{B}_{th}$ of \lzo as obtained in sections \ref{sec:expected_th}. We follow the same experimental setting as in \cite{nieves2021sparse} for a fair comparison. We use a fully connected LIF neural network with two hidden layers of 200 neurons each and input and output layers. We train every model for 20 epochs and report the average training and test accuracies computed over five trials. We compute the speedup of \spgd and \lzo,  with respect to the full surrogate without truncation, that uses standard back-propagation. The backward speedup (Back.) captures the number of times the backward pass of a gradient update is faster, while the overall speedup (Over.) considers the total time for the forward and the backward pass and then computes the speedup. The speedup reported is averaged over all the gradient updates and the experimental trials.

We compare the performance of the algorithms on three datasets: 1) Neuromorphic-MNIST (NMNIST) \cite{orchard2015converting}, which consists of static images of handwritten digits (between 0 and 9) converted to temporal spiking data using visual neuromorphic sensors; 2) Spiking Heidelberg Digits (SHD) \cite{cramer2020heidelberg}, a neuromorphic audio dataset consisting of spoken digits (between 0 and 9) in English and German language, totalling 20 classes. To challenge the generalizability of the learning task, 81\% of test inputs of this dataset are new voice samples, which are not present in the training data; 3) Fashion-MNIST (FMNIST) \cite{xiao2017fashion} dataset is converted using temporal encoding to convert static gray-scale images based on the principle that each input neuron spikes only once, and a higher intensity spike results in an earlier spike. 
\begin{wraptable}[30]{r}{0.6\textwidth}  %[11]{L}{0.5\textwidth }
%\vspace{-6mm}
\begin{minipage}{0.6\textwidth}
%\begin{table}[ht]
\caption{Performance on NMNIST, SHD and FMNIST}
\label{tab:dataset_SHD}
%\vspace{-3mm}
\begin{center}
\begin{footnotesize}
\begin{tabular}{lcccc}
\toprule
Method & Train & Test & Back. & Over. \\
\midrule
\textbf{NMNIST} & \multicolumn{3}{c}{$z \sim$ Normal$(0,1)$, $\delta=0.05, m=1$ }\\ %
\midrule
%\sur &95.25 $\pm$ 0.14& 93.70$\pm$ 0.10& 1 & 1 \\
\spgd &93.26 $\pm$ 0.31& 91.86$\pm$ 0.29& 99.57 & 3.38\\
\lzo   &94.38 $\pm$ 0.12& 93.29$\pm$ 0.08& 92.27 & 3.34\\
\midrule
& \multicolumn{3}{c}{ Sigmoid, $\delta=0.05, k \approx 30.63, m=1$ }\\ %
\midrule
\spgd &92.96$\pm$ 0.26& 91.04$\pm$ 0.32& 87.45 & 3.00\\
\lzo    &93.98$\pm$ 0.08& 92.97$\pm$ 0.05& 83.54 & 3.02\\
\midrule
\midrule
\textbf{SHD} & \multicolumn{3}{c}{ $z \sim$ Normal$(0,1)$, $\delta=0.05, m=1$ }\\ %
\midrule
%\surt  &94.58$\pm$ 0.31& 75.48$\pm$ 0.70& 1 & 1\\
\spgd &92.03$\pm$ 0.79& 74.73$\pm$ 0.73& 143.7 & 4.83\\
\lzo    &91.77$\pm$ 0.27& 76.55$\pm$ 0.93& 142.8 & 4.75\\
\midrule
&\multicolumn{3}{c}{Sigmoid, $\delta=0.05, k \approx 30.63, m=1$ }\\ %
\midrule
%\surt  &94.58$\pm$ 0.31& 75.48$\pm$ 0.70& 1 & 1\\
\spgd &92.19$\pm$ 0.41& 75.80$\pm$ 0.97& 140.8 & 4.46\\
\lzo    &91.96$\pm$ 0.11& 76.97$\pm$ 0.40& 133.6 & 4.36\\
\midrule
\midrule
\textbf{FMNIST} & \multicolumn{3}{c}{ $z \sim$ Normal$(0,1)$, $\delta=0.05, m=1$ }\\ %
\midrule
%\surt  &86.21$\pm$  0.05& 83.35$\pm$ 0.08& 1 & 1\\
\spgd &81.91$\pm$ 0.10& 80.28$\pm$ 0.11& 15.74 & 1.97\\
\lzo  &83.83$\pm$ 0.07& 81.79$\pm$ 0.06& 15.49 & 1.88\\
\midrule
&\multicolumn{3}{c}{Sigmoid, $\delta=0.05, k \approx 30.63, m=1$ }\\ %
\midrule
\spgd &81.60$\pm$ 0.11& 80.02$\pm$ 0.08& 12.12 & 1.65\\
\lzo  &83.39$\pm$ 0.10& 81.76$\pm$ 0.10& 12.50 & 1.57\\
\bottomrule
\end{tabular}
\end{footnotesize}
\end{center}
%\vskip -0.3in
%\end{table}
\end{minipage}
\end{wraptable}


%We also compute the active neurons at each layer as a percentage, normalizing by the batch size, number of neurons in the layer, and the latency. The normalization reflects the computation required by a non-sparse gradient so that the percentage of active neurons serves as the proxy of computational savings due to the energy-efficient implementation.
%The data labeled into ten classes simulates the neuronal inputs captured by the biological visual sensors.  
%The experiments are carried out on an NVIDIA RTX A6000 GPU, with Pytorch CUDA extension.
%Note that $\tilde{B}_{th}$ depends on the distribution $\lambda$, the number of samples $m$ used by \lzo, and the parameter $\delta$.


Table \ref{tab:dataset_SHD} provides a comparison of the algorithms, using surrogates corresponding to the Normal and Sigmoid, with $\delta = 0.05$ and $m=1$. For the normal distribution, we supply \spgd algorithm the back-propagation threshold $\tilde{B}_{th}$ obtained in Table \ref{tab:expected_th}. In the section \ref{sec:surr_to_dist}, we derived distributions corresponding to the Sigmoid surrogate. We use inverse transform sampling (see \ref{ssec: inverse transform}), and take the temperature parameter $k=a/\delta\approx 30.63$ so that $c=\frac{\delta^2 k^2}{a^2}=1$ and supply \spgd method the corresponding back-propagation threshold, $\Tilde{B}_{th} = 0.766\delta$.  The \lzo method offers better test accuracies than \spgd, with a slight compromise in speedup due to the sampling of random variable $z$. The difference between training and test accuracies for the SHD dataset can be attributed to the unseen voice samples in the test data\cite{cramer2020heidelberg}. 

\begin{wrapfigure}[13]{r}{0.4\textwidth}
\label{fig:neuron_sparsity}
 \vspace{-5mm}
  \begin{center}
    \includegraphics[width=0.38\textwidth]{plots/neuron_sparisty-crop.pdf}
  \end{center}
  \caption{Comparison of gradient sparsity in CNN architecture}
  %\vspace{-5mm}
\end{wrapfigure}

Figure \ref{fig:sigmoid} shows the training loss, overall speedup, and percentage of active neurons after each gradient step for the Sigmoid surrogate. The sparseness of active neurons (under 0.6\%) explains the reduced computational requirement that translates to the speedup. 

We further implement \lzo with $\delta=0.5, z \sim$ Normal$(0,1)$ to train a CNN architecture (Input- 16C5-BN-LIF-MP2-32C5-BN-LIF-MP2-800FC-10) and compare it with the corresponding surrogate gradient algorithm. Figure 3 shows the corresponding sparsity of the methods by plotting the number of zero elements at a neuronal level. The plot suggests that during the training, \lzo exhibits higher sparsity gradients than the surrogate method.


\textbf{Ablation study:} Table\ref{tab:tradeoff} further shows the test accuracy of the \lzo method and overall speed-up for a wide range of values of $m$ with $z \sim$ Normal$(0,1)$. Like Table \ref{tab:dataset_SHD}, the experiments are repeated five times and mean test accuracy is reported along with standard deviation (Std.). In general, by increasing $m$, the method approximates the surrogate better, still offers the regularizing effect and potentially improves the generalization, but also requires more computation. Larger $m$ leads to more non-zero gradients at the neuronal level in the backward pass, reducing overall speed-up. On the other hand, smaller $m$ introduces higher randomness (less “controlled”), still yielding regularization, which helps obtain better generalization, as well as potential speed-up. In conclusion, $m$ should be treated as a hyper-parameter, its value depending on the training setting itself. In our experiments, we chose $m = 1$ or $5$ for most of the experiments, as a proof of concept, but also because it offers a nice balance between the speed-up and performance.

\begin{table}[ht]
\caption{Trade-off of accuracy vs. speedup with hyper-parameter $m$}
\label{tab:tradeoff}
\centering
\begin{footnotesize}
\begin{tabular}{lccccccc} 
\toprule
 \textbf{m}& 1 & 3 & 5 & 7 & 10 & 20 & 100 \\
 \midrule
 \multicolumn{8}{c}{\textbf{NMNIST} }\\
 \midrule
Accuracy& 93.29 & 93.61 & 93.69 & 93.66 & 93.76 & 93.67 & 93.81 \\ 
 Std.& 0.08 & 0.15 & 0.17 & 0.13 & 0.14 & 0.08 & 0.14 \\ 
 Over.& 3.33 & 3.28 & 3.22 & 3.16 & 3.06 & 2.82 & 1.59 \\
 \midrule
 \multicolumn{8}{c}{\textbf{SHD} }\\
 \midrule
Accuracy & 76.55 & 76.55 & 76.50 & 75.49 & 75.51 & 74.96 & 76.71 \\
 Std.& 0.93 & 0.65 & 0.90 & 0.66 & 0.81 & 0.68 & 0.49 \\
 Over.& 4.75 & 4.62 & 4.47 & 4.39 & 4.25 & 3.89 & 2.24 \\
 \midrule
 \multicolumn{8}{c}{\textbf{FMNIST} }\\
 \midrule
Accuracy & 81.79 & 83.40 & 83.64 & 83.70 & 83.85 & 83.75 & 83.87 \\
Std. &0.06	&0.06	&0.12	&0.04	&0.11	&0.11	&0.05\\
Over.&1.89	&1.85&	1.78&	1.75&	1.70&	1.56	&0.88\\
\bottomrule
\end{tabular}
\end{footnotesize}
\end{table}

\section{Discussions} We propose a new direct training algorithm for SNNs that establishes a formal connection between the standard surrogate methods and the zeroth order method applied locally to the neurons. The method introduces systematic randomness in the training that helps in better generalization. The method simultaneously lends itself to efficient back-propagation. We experimentally demonstrate the efficiency of the proposed method in terms of speed-up obtained in training under specialized implementations and its top generalization performance when combined with other training methods, ameliorating their respective strengths. 


\section*{Acknowledgement} This work is part of the research project "ENERGY-BASED PROBING FOR SPIKING NEURAL NETWORKS" performed at Mohamed bin Zayed University of Artificial Intelligence~(MBZUAI), in collaboration with  Technology Innovation Institute~(TII) (Contract No. TII/ARRC/2073/2021).