Brain network science modelling of sparse neural networks enables Transformers and LLMs to perform as fully connected
Keywords: Dynamic sparse training, Network science, Epitopological learning
Abstract: The Cannistraci-Hebb training (CHT) is a brain-inspired method that is used in Dynamic Sparse Training (DST) for growing synaptic connectivity in sparse neural networks. CHT leverages a gradient-free, topology-driven link regrowth mechanism, which has been shown to achieve ultra-sparse (1\% connectivity or lower) advantage across various tasks compared to fully connected networks (FCs). Yet, CHT suffers two main drawbacks: (i) its time complexity is $\mathcal{O}(N\cdot d^3)$- N node network size, d node degree - hence it can be efficiently applied only to ultra-sparse networks. (ii) it rigidly selects top link prediction scores, which is inappropriate for the early training epochs, when the network topology presents many unreliable connections. Here, we design the first brain-inspired network model - termed bipartite receptive field (BRF) - to initialize the connectivity of sparse artificial neural networks. Then, we propose a matrix multiplication GPU-friendly approximation of the CH link predictor, which reduces the computational complexity to $\mathcal{O}(N^3)$, enabling a fast implementation of link prediction in large-scale models. Moreover, we introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a flexible strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. We also propose a sigmoid-based gradual density decay strategy, leading to an advanced framework referred to as CHTss. Empirical results show that using 1\% of connections, CHTs outperform FCs in MLP architectures on visual classification tasks and compress some networks to less than 30\% of the nodes. Using only 5\% of the connections, CHTss outperforms FCs in two Transformer-based machine translation tasks. Finally, using 30\% of the connections, CHTs and CHTss achieve superior performance compared to other dynamic sparse training methods in language modeling across different sparsity levels on LLaMA 60M, 130M, and 1B, and CHTs outperforms FC on the LLaMA1B model.
Submission Number: 103
Loading