# N Multipliers for N Bits: Learning Bit Multipliers for Non-Uniform Quantization

### **Anonymous Author(s)**

Affiliation Address email

## **Abstract**

Effective resource management is critical for deploying Deep Neural Networks 2 (DNNs) in resource-constrained environments, highlighting the importance of 3 low-bit quantization to optimize memory and speed. In this paper, we introduce N-Multipliers-for-N-Bits, a novel method for non-linear quantization designed for efficient hardware implementation. Our method uses N parameters, distinct for 5 every layer and corresponding to the N quantization bits, whose linear combinations 6 span the set of allowed weights (and activations). Furthermore, we learn these parameters in parallel with the weights ensuring exceptional flexibility in the 8 quantizer model with minimal hardware overhead. We validate our method on 9 CIFAR10 and ImageNet, achieving competitive results with 3- and 4-bit quantized 10 models. We demonstrate strong performance on 4-bit quantized Spiking Neural 11 12 Networks (SNNs), evaluated on the CIFAR10-DVS and N-Caltech 101 datasets. Further, we address the issue of stuck-at faults in hardware, and demonstrate 13 robustness to up to 30% faulty bits. 14

# 1 Introduction

16

17

18

19

20

21

22

25

26

27

28

29

32 33

34

Deep learning dominates computer vision and broader AI applications, with cloud-based models performing inference by transferring data to servers. While effective, this approach is inefficient in terms of data transfer and power consumption. A more efficient alternative, especially for simple tasks, is edge inference using low-power accelerators with fixed-point arithmetic and in-memory or near-memory computing architectures [1, 2, 3, 4, 5]. These architectures, such as crossbar arrays, perform matrix-vector multiplication by accumulating parallel operations. They can be implemented using analog components or digital ones, but both approaches encounter a trade-off between energy efficiency and performance [6, 7]. Quantization, while improving efficiency, often degrades accuracy and is further impacted by hardware faults such as stuck-at (SA) faults [8] where certain weight bits get stuck at either 0 or 1 and become unprogrammable. Addressing both quantization errors and hardware faults is crucial for optimizing edge inference. Low-bit quantization for weights and activations has been explored extensively through quantization-aware training (QAT) and methods such as uniform and non-uniform quantization [9, 10, 11, 12]. Non-uniform methods, such as learning quantization levels or companding functions [13, 14], offer flexibility by learning key parameters. In this work, we propose a QAT scheme that optimizes bit multipliers for each quantization level, balancing performance and hardware efficiency. Unlike prior approaches, our method provides maximum flexibility during learning while still being hardware-friendly, and avoids the inaccuracies introduced by using gradient estimators.

Spiking Neural Networks (SNNs) and neuromorphic hardware present a promising solution to the challenge of energy-efficient edge inference. SNNs mimic the behavior of biological neurons by

processing information through discrete spikes, making them inherently event-driven and power-36 efficient [15, 16]. Neuromorphic systems, such as TrueNorth and Loihi [17, 18], are designed to 37 leverage the sparse, asynchronous nature of SNNs, enabling real-time inference with significantly 38 lower power consumption compared to conventional Artificial Neural Network (ANN) accelerators. 39 When combined with low-bit quantization, SNNs offer further energy reductions without sacrificing 40 performing, especially when using temporal datasets. We attain excellent performance on 4-bit SNNs, 41 which can enable extreme low-power inference when ported on neuromorphic hardware.

Energy efficiency in edge devices often comes at the cost of circuit non-idealities such as line 43 resistance and device variability [19, 20, 21], with SA faults introducing more significant challenges. 44 Existing solutions [22, 8] attempt to handle SA faults via variable encoding or fault-aware training. We 45 extend fault-aware training by incorporating faulty weights into QAT, modifying the regularization 46 loss to prevent invalid weight configurations caused by SA faults. Our approach enables robust 47 training for low-bit quantized models even with a high rate of hardware faults. 48

Our key contributions are summarized as follows: 49

- We introduce a novel, flexible, and hardware-compatible quantization framework that learns N bit multipliers per layer alongside network weights, enabling adaptable precision with minimal hardware overhead, while spanning a rich set of quantization levels.
- We show our method's effectiveness across multiple networks and datasets, achieving comparable state-of-the-art results for 3- and 4-bit DNNs on CIFAR10 [23] and ImageNet [24], and 4-bit SNNs on event-based datasets: CIFAR10-DVS [25] and N-Caltech 101 [26].
- We propose a fault-tolerant quantization method that enables low-bit models to maintain performance up to 30% faulty bits, as demonstrated on CIFAR10, enhancing robustness.
- We propose a custom implementation of bit-level multipliers for analog/digital crossbars, optimized for our quantization scheme and directly portable to neuromorphic hardware.

#### Methodology 60

50

51

52

53

54

55

56

57

58

59

61

62

69

Preliminaries: Quantization aims to replace floating-point weights and activations in DNNs with lowbit representations to reduce memory usage and speed up computations. A general N-bit quantizer function will have  $2^N$  levels, say  $l_1, l_2, \ldots, l_{2^N}, 2^N - 1$  transition thresholds, say  $t_1, t_2, \ldots, t_{2^N-1}$ , and is defined as follows:

$$Q(x) = \begin{cases} l_1 & \text{if } x < t_1 \\ l_i & \text{if } t_{i-1} \le x < t_i, \quad i = 2, 3, \dots, 2^N - 1 \\ l_{2^N} & \text{if } x \ge t_{2^N - 1} \end{cases}$$
 (1)

Quantizer Model: We introduce an N-dimensional learnable vector  $r \in \mathbb{R}^N$ , which defines the N bit multipliers, alongside a scalar offset c in our quantizer model. The set of allowed quantized weights or activations is given by:

$$W_r = \left\{ \langle r, b \rangle + c \mid b \in \{0, 1\}^N \right\} \tag{2}$$

The quantization function maps each full-precision weight to its nearest quantized counterpart:

$$\hat{x} = Q(x, r) = \arg\min_{w_q \in W_r} |x - w_q|$$
(3)

This design enables a flexible non-uniform quantizer with multiple step sizes, offering hardware efficiency while preserving the structure of N-bit quantization. Although learning all  $2^N$  quantization 70 levels would offer maximum flexibility, it would undermine hardware efficiency and the core benefits 71 of N-bit quantization. Figure 1a illustrates a sample quantizer function. Drawing parallels between a 72 general N-bit quantizer and the one introduced above, we can see that the elements of the set  $W_r$ 73 serve as the levels,  $l_1, l_2, \ldots, l_{2^N}$ , and the transition thresholds are defined as  $t_i = (l_i + l_{i+1})/2$ 74 Loss and Learning: We jointly optimize the bit multipliers, offsets, and weights by introducing 75 an additional quantization-aware loss alongside the standard cross-entropy loss. This allows the 76 model parameters to be optimized through backpropagation within the usual training pipeline. During 77 training, the weights remain in full precision but gradually align with their quantized counterparts due to the influence of the quantization-aware loss. The actual quantization is applied post-training, where the full-precision weights are mapped to their nearest quantized values.

**Quantization-Aware Loss:** We define a regularization loss that minimizes the squared error between each weight and its nearest quantized value. To balance gradient contributions across layers, we introduce a layer-specific scaling factor. The total loss is formulated as:

$$\mathcal{L} = \mathcal{L}_{CE} + \lambda \sum_{l=1}^{L} \alpha_l \sum_{i=1}^{n_l} \min_{w_q \in W_r^l} ||w_i - w_q||^2$$
(4)

where  $\mathcal{L}_{CE}$  is the cross-entropy loss and  $W_r^l$  represents the set of quantized weights for layer l, defined by parameters  $r^l$  and  $c^l$ . The term  $\alpha_l$  is a layer-wise scaling factor, and  $\lambda$  controls the regularization strength. Following other works[10], we set  $\alpha_l$  as  $1/\sqrt{N\cdot Q_P}$ , where  $Q_P$  is  $2^b-1$  for activations (unsigned data) and  $2^{b-1}-1$  for weights (signed data), respectively; b denotes the number of bits. Figure 1b illustrates the regularization loss for a sample weight using an arbitrary vector r to define the quantized weight set. Equivalently, the loss can be expressed as a function of the weights and bit multipliers. This formulation jointly optimizes the overall objective and the quantization parameters, including the bit multipliers and offsets that define the quantization function itself.

$$\mathcal{L} = \mathcal{L}_{CE} + \lambda \sum_{l=1}^{L} \alpha_l \sum_{i=1}^{n_l} |w_i - Q(w_i, r^l)|^2$$
 (5)

Gradient Calculation: The gradient calculation for the weights and quantizer parameters is



Figure 1: Our quantizer model is non-uniform and learnable. The quantization-aware loss forces weights towards their allowed quantized levels, and the levels towards the weights. The weights are kept in FP during training, and are quantized to their closest allowed level during inference.

straightforward. Since we use full precision weights throughout the training, we can simply define  $\frac{\partial Q(w,r)}{\partial w}=0$ , thereby eliminating the need of any gradient approximation techniques. For the quantizer parameters,  $\frac{\partial Q(w,r)}{\partial c}=1$  and  $\nabla_r Q(w,r)=B_r(Q(w,r))$ , where  $B_r$  is an inverse map defined as  $B_r:W_r\to\{0,1\}^N$ , providing the bit representation vector of the quantized weights. This encoding function satisfies  $w_q=\langle r,B_r(w_q)\rangle+c\forall w_q\in W_r$ . The gradients for the weights, bit multipliers, and offsets are calculated as follows:

$$\frac{\partial L}{\partial w} = \frac{\partial L_{CE}}{\partial w} + 2\lambda \cdot \alpha_l \cdot (w - Q(w, r^l)) \tag{6}$$

$$\frac{\partial L}{\partial r^l} = 2\lambda \cdot \alpha_l \sum_{i=1}^{n_l} (w_i - Q(w_i, r^l)) \cdot B_r(Q(w_i, r^l))$$
(7)

$$\frac{\partial L}{\partial c^l} = 2\lambda \cdot \alpha_l \sum_{i=1}^{n_l} (w_i - Q(w_i, r^l)) \cdot \left(-\frac{\partial Q(w_i, r^l)}{\partial c^l}\right) = 2\lambda \cdot \alpha_l \sum_{i=1}^{n_l} (Q(w_i, r^l) - w_i) \tag{8}$$

**SNN Training:** SNNs inherently produce *quantized activations* in the form of *spike trains*, we thus need to solely quantize the weights of the network. We use a Leaky Integrate-and-Fire (LIF) model

[27] for the spiking neuron in our SNN models. These discrete-time equations describe its dynamics:

$$H[t] = V[t-1] + \beta(X[t] - (V[t-1] - V_{reset}))$$
(9)

$$S[t] = \Theta(H[t] - V_{th}) \tag{10}$$

$$V[t] = H[t] (1 - S[t]) + V_{reset} S[t]$$
(11)

where X[t] denotes the input current at time step t. H[t] denotes the membrane potential following neural dynamics and V[t] denotes the membrane potential after a spike at step t, respectively. The model uses a firing threshold  $V_{th}$  and utilizes the Heaviside step function  $\Theta(x)$  to determine spike generation. The output spike at step t is denoted by S[t], while  $V_{reset}$  represents the reset potential following a spike. The membrane decay constant is denoted by  $\beta$ . To facilitate error backpropagation, we use the surrogate gradient method [28], defining  $\Theta'(x) \triangleq \sigma'(x)$ , where  $\sigma(x)$  is the arctan surrogate function [29]. The remaining part of the training and quantization follows that of the non-spiking networks described earlier.

**Fault-Aware Modification:** We propose a two-pronged approach to address SA faults in quantized neural networks. Firstly, we enhance fault awareness during training by periodically (every 4 epochs) loading faulty weights onto the model. Secondly, we introduce a fault-aware modification to our algorithm, designed to avoid weight configurations rendered impossible by SA faults. We introduce a *validity* term that constrains weights to only those quantization levels that are achievable, avoiding those rendered unreachable by faulty bits. The *validity* term is defined for each layer as a binary map that indicates whether a specific weight can attain a given quantization level (1 if achievable, 0 otherwise). This allows us to modify the quantization-aware training loss in Equation 4 as follows:

$$\mathcal{L} = \mathcal{L}_{CE} + \lambda \sum_{l=1}^{L} \alpha_l \sum_{i=1}^{n_l} \min_{w_q \in W_r^l} (val_{i,q}^l \mid w_i - w_q \mid^2 + (1 - val_{i,q}^l) \cdot \Delta)$$
 (12)

Here,  $val_{i,q}^l$  represents the validity term for weight  $w_i$  in layer l with respect to the quantization level  $w_q \in W_r^l$ . If  $w_i$  can reach  $w_q$ , then  $val_{i,q}^l = 1$ ; otherwise,  $val_{i,q}^l = 0$ . The term  $\Delta$  is a large constant that penalizes unreachable quantization levels, effectively excluding them from the optimization.

# 3 Experiments

We initialize quantized networks with weights from a trained full-precision model of the same architecture, then fine-tune in the quantized space, which hase been proven to improve performance [30, 31, 32]. We quantize input activations and weights to 3- or 4-bits for all matrix multiplication layers except the first and last, which use 8-bits. This approach is commonly used for quantizing deep networks, and has been proven to increase effectiveness at the cost of minimal overhead [10]. The weights and the quantization parameters: bit multipliers and the offset values, are trained using SGD with a momentum of 0.9 and a cosine learning rate decay schedule [33]. We sweep over different values of the regularization hyperparameter  $\lambda$  and chose  $\lambda = 100$  for our results.

ANN Training Details. We use the ResNet-18 [34] architecture for experiments on CIFAR10 [23] and ImageNet [24] datasets. Models are trained for 200 epochs on CIFAR10 and 90 epochs on ImageNet with the weights having a learning rate of 0.01 and 0.1 respectively. The other parameters are trained with a learning rate of 0.001. For ImageNet, we preprocess images by resizing them to  $256 \times 256$  pixels. During training, we apply random  $224 \times 224$  crops and horizontal flips half the time. At inference, we use a center crop of  $224 \times 224$ . For CIFAR-10, we augment the training data by padding images with 4 pixels on each side, then taking random 32x32 crops. We also apply random horizontal flips half the time. The results are shown in Table 1 and 2.

SNN Training Details. We use the ResNet-19 [35] and VGG-11 [36] models, after adapting them to SNNs. Specifically, we replace all ReLU activation functions with LIF modules and substitute maxpooling layers with average pooling operations. We follow the implementation and data augmentation technique used in NDA [37] as our baseline training method. The weights and the other parameters are trained with a learning rate of 0.01 and 0.001 respectively. We evaluate on the N-Caltech 101 and CIFAR10-DVS benchmarks. N-Caltech 101 consists of 8,831 DVS images converted from the original Caltech 101 dataset, while CIFAR10-DVS comprises 10,000 DVS images derived from the original CIFAR10 dataset. For both these datasets, we apply a 9:1 train-validation split and resize all images to  $48 \times 48$ . Each sample is temporally integrated into 10 frames using spikingjelly [38].  $V_{reset}$  is set to 0 and the membrane decay  $\beta$  is 0.25. Our results are presented in Table 3.

Fault-Aware Training. We evaluate our method on the VGG-13 architecture, training with 3-bit and 4-bit precision for both weights and activations on the CIFAR10 dataset. Our experiments consider varying levels of SA fault density. Figures 2a and 2b illustrate the efficacy of our approach for 4-bit and 3-bit quantization, respectively.

# 4 Results and Analysis

152

153

154

155

156

157

159

160

161

162

163

Comparison with Baselines. Tables 1 and 2 present our quantized ANN results for CIFAR10 and ImageNet, respectively. Our method outperforms existing approaches, with 4-bit ResNet-18 achieving a 0.24% accuracy increase over full-precision (FP) on CIFAR10 and matching FP performance on ImageNet. For 4-bit quantized SNNs (Table 3), we observe performance gains on N-Caltech 101 and marginal losses on CIFAR10-DVS compared to FP. We attribute occasional performance improvements in both 4-bit ANNs and SNNs to the regularization effect of our quantization loss.

Table 1: Accuracy (%) for 3- and 4- bit quantized ResNet-18 models on CIFAR10. FP denotes full-precision accuracy,  $\Delta$  FP denotes difference in performance compared to the corresponding FP network. **Best**/second best relative performances for each bit-width are marked in **bold**/underlined.

| Method       | FP    | W4/A4 ( $\Delta$ FP)      | W3/A3 ( $\Delta$ FP)      |
|--------------|-------|---------------------------|---------------------------|
| L1 Reg [39]  | 93.54 | 89.98(-3.56)              | -                         |
| BASQ [40]    | 91.7  | 90.21(-1.49)              | -                         |
| LTS [41]     | 91.56 | 91.7 (+0.1)               | 90.58 (-0.98)             |
| PACT [9]     | 91.7  | $91.3(\overline{-0.4})$   | 91.1(-0.6)                |
| LQ-Nets [13] | 92.1  | -                         | 91.6(-0.5)                |
| LCQ [14]     | 93.4  | 93.2(-0.2)                | $92.8  (\overline{-0.6})$ |
| Ours         | 93.26 | $93.50  (\mathbf{+0.24})$ | 92.84 (-0.42)             |

Table 2: Accuracy (%) for 4- bit quantized ResNet-18 models on ImageNet. FP denotes full-precision accuracy,  $\Delta$  FP denotes difference in performance compared to the corresponding FP network. **Best**/second best relative performances for each bit-width are marked in **bold**/underlined.

| Method       | FP   | W4/A4 ( $\Delta$ FP) |
|--------------|------|----------------------|
| L1 Reg [39]  | 69.7 | 57.5 (-12.5)         |
| SinReQ [42]  | 70.5 | 64.6 (-5.9)          |
| LTS [41]     | 69.6 | 68.3(-1.3)           |
| PACT [9]     | 69.7 | 69.2 (-0.5)          |
| LQ-Nets [13] | 70.3 | 69.3(-1.0)           |
| QIL [43]     | 70.2 | 70.1(-0.1)           |
| QSin [44]    | 69.8 | 69.7(-0.1)           |
| LCQ [14]     | 70.4 | 71.5 (+1.1)          |
| Ours         | 69.6 | 69.6 ( <u>-0.0</u> ) |

Table 3: Accuracy (%) for 4- bit quantized SNNs on CIFAR10-DVS and N-Caltech 101. FP denotes full-precision accuracy,  $\Delta$  FP denotes difference in performance compared to the FP network.

| Dataset       | Model             | FP    | W4 ( $\Delta$ FP) |
|---------------|-------------------|-------|-------------------|
| CIFAR10-DVS   | Spiking VGG-11    | 71.92 | 71.84 (-0.08)     |
| CIFAR10-DVS   | Spiking ResNet-19 | 72.91 | 72.14(-0.77)      |
| N-Caltech 101 | Spiking VGG-11    | 73.19 | 74.18 (+0.99)     |
| N-Caltech 101 | Spiking ResNet-19 | 75.27 | 75.93 (+0.66)     |

**Robustness to Faults.** SA faults represent extreme non-idealities in hardware, with each faulty bit halving the range of possible weight values. High device variability in conductance states can similarly cause significant discrepancies between expected and realized weights. Our approach, combining periodic loading of faulty weights during training with a fault-aware modified QAT algorithm, demonstrates robust performance even under high SA fault densities.

**Hardware Compatibility.** Figures 3a and 3b illustrate implementations of custom bit multipliers in analog and digital crossbar arrays, respectively, compatible with our quantization method. For analog



Figure 2: Performance preservation with SA faults: periodic faulty weight loading maintains accuracy for low fault densities; our fault-aware modified QAT extends robustness to high fault fractions.

arrays, the implementation incurs no additional cost, requiring only adjustment of bit-multiplier conductance values from power-of-2 proportions to custom values. In digital arrays, the multiply-accumulate operation remains unchanged, but peripheral circuitry must be modified to convert right-shift operations to multiplications, introducing a modest overhead. Learning custom bit multipliers within QAT can enable highly effective low-bit quantization models, which are compatible with the standard in-memory computing architectures.



Figure 3: Custom bit multipliers in analog/digital crossbar arrays, compatible with our quantizer.

## 5 Conclusion and Future Work

We introduce a novel algorithm for learning bit multipliers within QAT, enabling efficient low-bit quantization models with learnable, non-uniform levels compatible with in-memory computing architectures. Our approach demonstrates minimal accuracy drops for 3- and 4-bit models compared to FP baselines across various datasets and architectures, including CIFAR10 and ImageNet using ResNet-18, and CIFAR10-DVS and N-Caltech 101 using spiking VGG-11 and ResNet-19. Notably, our quantized models occasionally outperform their FP counterparts. We further extend our method to address SA faults, maintaining performance with up to 30% faulty bits. Future directions include extending the method to channel-specific quantizers, conducting fault-aware training experiments on additional benchmarks, expanding ANN and SNN model evaluations, and exploring sub-3-bit quantization. These advancements aim to enhance the efficiency and robustness of quantized neural networks for resource-constrained environments and hardware non-idealities.

## References

- [1] Navjot Kukreja, Alena Shilova, Olivier Beaumont, Jan Huckelheim, Nicola Ferrier, Paul Hovland, and Gerard Gorman. Training on the edge: The why and the how. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 899–903. IEEE, 2019.
- 189 [2] Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, 9(2):292–308, 2019.
- [3] Yu-Der Chih, Po-Hao Lee, Hidehiro Fujiwara, Yi-Chun Shih, Chia-Fu Lee, Rawan Naous,
   Yu-Lin Chen, Chieh-Pu Lo, Cheng-Han Lu, Haruki Mori, et al. 16.4 an 89tops/w and 16.3
   tops/mm 2 all-digital sram-based full-precision compute-in memory macro in 22nm for machine-learning edge applications. In 2021 IEEE International Solid-State Circuits Conference (ISSCC),
   volume 64, pages 252–254. IEEE, 2021.
- [4] Hongyang Jia, Hossein Valavi, Yinqi Tang, Jintao Zhang, and Naveen Verma. A programmable heterogeneous microprocessor based on bit-scalable in-memory computing. *IEEE Journal of Solid-State Circuits*, 55(9):2609–2621, 2020.
- [5] Jae-sun Seo, Jyotishman Saikia, Jian Meng, Wangxin He, Han-sok Suh, Yuan Liao, Ahmed
   Hasssan, Injune Yeo, et al. Digital versus analog artificial intelligence accelerators: Advances,
   trends, and emerging designs. *IEEE Solid-State Circuits Magazine*, 14(3):65–79, 2022.
- [6] Lixia Han, Renjie Pan, Zheng Zhou, Hairuo Lu, Yiyang Chen, Haozhang Yang, Peng Huang,
   Guangyu Sun, Xiaoyan Liu, and Jinfeng Kang. Comn: Algorithm-hardware co-design platform
   for non-volatile memory based convolutional neural network accelerators. *IEEE Transactions* on Computer-Aided Design of Integrated Circuits and Systems, 2024.
- [7] Hanbo Sun, Zhenhua Zhu, Chenyu Wang, Xuefei Ning, Guohao Dai, Huazhong Yang, and Yu Wang. Gibbon: An efficient co-exploration framework of nn model and processing-in-memory architecture. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 2023.
- [8] Muhammad Abdullah Hanif and Muhammad Shafique. Faq: Mitigating the impact of faults in the weight memory of dnn accelerators through fault-aware quantization. *arXiv preprint arXiv:2305.12590*, 2023.
- [9] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi
   Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized
   neural networks. arXiv preprint arXiv:1805.06085, 2018.
- 217 [10] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dhar-218 mendra S Modha. Learned step size quantization. *arXiv preprint arXiv:1902.08153*, 2019.
- 219 [11] Chen Tang, Kai Ouyang, Zhi Wang, Yifei Zhu, Wen Ji, Yaowei Wang, and Wenwu Zhu.
  220 Mixed-precision neural network quantization via learned layer-wise importance. In *European*221 *Conference on Computer Vision*, pages 259–275. Springer, 2022.
- Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch. Weighted quantization-regularization in dnns for weight memory minimization toward hw implementation. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 37(11):2929–2939, 2018.
- 226 [13] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantiza-227 tion for highly accurate and compact deep neural networks. In *Proceedings of the European* 228 conference on computer vision (ECCV), pages 365–382, 2018.
- 229 [14] Kohei Yamamoto. Learnable companding quantization for accurate low-bit neural networks. In
  230 Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
  231 5029–5038, 2021.

- 232 [15] Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models.

  Neural networks, 10(9):1659–1671, 1997.
- [16] Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. Towards spike-based machine intelligence with neuromorphic computing. *Nature*, 575(7784):607–617, 2019.
- [17] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul
   Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, et al. Truenorth: Design
   and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. *IEEE transactions* on computer-aided design of integrated circuits and systems, 34(10):1537–1557, 2015.
- 240 [18] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha 241 Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic 242 manycore processor with on-chip learning. *Ieee Micro*, 38(1):82–99, 2018.
- [19] Xiaochen Peng, Shanshi Huang, Hongwu Jiang, Anni Lu, and Shimeng Yu. Dnn+ neurosim v2.
   0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 40(11):2306–2319, 2020.
- Malte J Rasch, Diego Moreda, Tayfun Gokmen, Manuel Le Gallo, Fabio Carta, Cindy Goldberg, Kaoutar El Maghraoui, Abu Sebastian, and Vijay Narayanan. A flexible and fast pytorch toolkit for simulating training and inference on analog crossbar arrays. In 2021 IEEE 3rd international conference on artificial intelligence circuits and systems (AICAS), pages 1–4. IEEE, 2021.
- [21] Corey Lammie, Wei Xiang, Bernabé Linares-Barranco, and Mostafa Rahimi Azghadi. Memtorch: An open-source simulation framework for memristive deep learning systems. *Neurocomputing*, 485:124–133, 2022.
- [22] Tao Liu, Wujie Wen, Lei Jiang, Yanzhi Wang, Chengmo Yang, and Gang Quan. A fault-tolerant neural network architecture. In *Proceedings of the 56th Annual Design Automation Conference* 256 2019, pages 1–6, 2019.
- <sup>257</sup> [23] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
   Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual
   recognition challenge. *International journal of computer vision*, 115:211–252, 2015.
- <sup>262</sup> [25] Hongmin Li, Hanchao Liu, Xiangyang Ji, Guoqi Li, and Luping Shi. Cifar10-dvs: an eventstream dataset for object classification. *Frontiers in neuroscience*, 11:309, 2017.
- [26] Garrick Orchard, Ajinkya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static
   image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience,
   9:437, 2015.
- <sup>267</sup> [27] Wulfram Gerstner and Werner M Kistler. *Spiking neuron models: Single neurons, populations, plasticity.* Cambridge university press, 2002.
- Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. *IEEE Signal Processing Magazine*, 36(6):51–63, 2019.
- [29] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian.
   Incorporating learnable membrane time constant to enhance learning of spiking neural networks.
   In Proceedings of the IEEE/CVF international conference on computer vision, pages 2661–2671,
   2021.
- [30] Jeffrey L McKinstry, Steven K Esser, Rathinakumar Appuswamy, Deepika Bablani, John V
   Arthur, Izzet B Yildiz, and Dharmendra S Modha. Discovering low-precision networks close to full-precision networks for efficient embedded inference. arXiv preprint arXiv:1809.04191, 2018.

- 280 [31] Wonyong Sung, Sungho Shin, and Kyuyeon Hwang. Resiliency of deep neural networks under quantization. *arXiv preprint arXiv:1511.06488*, 2015.
- 282 [32] Asit Mishra and Debbie Marr. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. *arXiv preprint arXiv:1711.05852*, 2017.
- 284 [33] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. *arXiv* preprint arXiv:1608.03983, 2016.
- [34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In *Proceedings of the IEEE conference on computer vision and pattern recognition*, pages 770–778, 2016.
- [35] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks. In *Proceedings of the AAAI conference on artificial intelligence*, volume 35, pages 11062–11070, 2021.
- 292 [36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. *arXiv preprint arXiv:1409.1556*, 2014.
- Yuhang Li, Youngeun Kim, Hyoungseob Park, Tamar Geller, and Priyadarshini Panda. Neuro morphic data augmentation for training spiking neural networks. In *European Conference on Computer Vision*, pages 631–649. Springer, 2022.
- [38] Wei Fang, Yanqi Chen, Jianhao Ding, Zhaofei Yu, Timothée Masquelier, Ding Chen, Liwei
   Huang, Huihui Zhou, Guoqi Li, and Yonghong Tian. Spikingjelly: An open-source machine
   learning infrastructure platform for spike-based intelligence. *Science Advances*, 9(40):eadi1480,
   2023.
- [39] Milad Alizadeh, Arash Behboodi, Mart Van Baalen, Christos Louizos, Tijmen Blankevoort, and Max Welling. Gradient 11 regularization for quantization robustness. *arXiv preprint* arXiv:2002.07520, 2020.
- [40] Han-Byul Kim, Eunhyeok Park, and Sungjoo Yoo. Basq: Branch-wise activation-clipping
   search quantization for sub-4-bit neural networks. In *European Conference on Computer Vision*,
   pages 17–33. Springer, 2022.
- Yunshan Zhong, Gongrui Nan, Yuxin Zhang, Fei Chao, and Rongrong Ji. Exploiting the partly scratch-off lottery ticket for quantization-aware training. *arXiv preprint arXiv:2211.08544*, 2022.
- 310 [42] Ahmed T Elthakeb, Prannoy Pilligundla, and Hadi Esmaeilzadeh. Sinreq: Generalized sinu-311 soidal regularization for low-bitwidth deep quantized training. *arXiv preprint arXiv:1905.01416*, 312 2019.
- Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. Learning to quantize deep networks by optimizing quantization intervals with task loss. In *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 4350–4359, 2019.
- Kirill Solodskikh, Vladimir Chikin, Ruslan Aydarkhanov, Dehua Song, Irina Zhelavskaya, and Jiansheng Wei. Towards accurate network quantization with equivalent smooth regularizer. In *European Conference on Computer Vision*, pages 727–742. Springer, 2022.