S3TC: Spiking Separated Spatial and Temporal Convolutions with Unsupervised STDP-Based Learning for Action Recognition

Mireille el Assal, Pierre Tirilly, Ioan Marius Bilasco

Published: 01 Jan 2024, Last Modified: 09 May 2025ICPR (26) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video analysis is a major computer vision task that has received a lot of attention in recent years. The current state-of-the-art performance in video analysis is achieved with Deep Neural Networks (DNNs) that have a high energy cost and need large amounts of labeled data for training. Spiking Neural Networks (SNNs) can have a significantly lower energy cost (thousands of times) than regular non-spiking networks when implemented on neuromorphic hardware [39, 40]. They have been used for video analysis with methods like 3D Convolutional Spiking Neural Networks (CSNNs). However, these networks have a significantly larger number of parameters than spiking 2D CSNNs. This not only increases their computational cost, but can also make them more difficult to implement on ultra-low power neuromorphic hardware. In this work, we use CSNNs trained in an unsupervised manner with the Spike Timing-Dependent Plasticity (STDP) rule, and we introduce, for the first time, Spiking Separated Spatial and Temporal Convolutions (S3TCs). Using unsupervised STDP for feature learning reduces the amount of labeled data required for training. Factorizing a single spatio-temporal spiking convolution into a spatial and a temporal spiking convolution decreases the number of parameters of the network. We test our network with the KTH, Weizmann, and IXMAS datasets. Our results show that S3TCs successfully extract spatio-temporal information from videos and outperform spiking 3D convolutions, while preserving the output spiking activity, which usually decreases with deeper spiking networks.