From Overconnectivity to Sparsity: Emulating Synaptic Pruning with Long Connections

Vaggelis Dorovatas; Georgios Paraskevopoulos; Alexandros Potamianos

From Overconnectivity to Sparsity: Emulating Synaptic Pruning with Long Connections

Vaggelis Dorovatas, Georgios Paraskevopoulos, Alexandros Potamianos

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: machine learning architectures, sparsity, residual connections, redundancy, long connections, pruning, synaptic pruning

Abstract: During brain development, an excess number of synapses are initially created, which are progressively eliminated through a process known as synaptic pruning. This procedure is activity-dependent, shaped by the brain's experiences. While creating an overabundance of synaptic connections only to later remove many might appear inefficient, research suggests that pruned networks demonstrate significant efficiency and robustness. Inspired by this biological process, we propose a neural network architecture utilizing long connections instead of traditional short residual connections. When long connections neural networks (LCNs) are trained with gradient descent, information is naturally "pushed" down to the first few layers, leading to a sparse network. Even more surprising is that this simple architectural modification leads to networks that exhibit behaviors similar to biological brain networks, namely: early overconnectivity to later sparsity, enhanced robustness to noise, efficiency in low-data settings and longer training times. Specifically, starting with a traditional neural network architecture with initial depth $d$ and $k$ connections, long connections are added from all layers to the last layer and summed up. During LCN training, 30-80% of the top layers become effective identity mappings as all relevant information is concentrated in the bottom layers. Pruning the top layers results in a refined network with a reduced depth $d'$ and final connections $k'$, achieving significant efficiencies without any loss in performance compared to residual baselines. We apply this architecture to various classification tasks and show that, in all experiments, the network converges to utilizing only a subset of the initially defined pre-training connections, and the amount of compression is dependent on the task complexity.

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12260

Loading