Spatial-temporal Data Compression of Dynamic Vision Sensor Output with High Pixel-level Saliency using Low-precision Sparse Autoencoder
Abstract: Imaging innovations such as dynamic vision sensor (DVS) can significantly reduce the image data volume by tracking only the changes in events. However, when DVS camera itself moves around (e.g. self-driving cars), the DVS output stream is not sparse enough to achieve the desired hardware efficiency. In this work, we investigate designing a compact sparse auto encoder model to largely compress event-based DVS output. The proposed encoder-decoder-based autoencoder design is a shallow convolutional neural network (CNN) architecture with two convolution and inverse-convolution layers with only ∼10k parameters. We implement quantization-aware training on our proposed model to achieve 2-bit and 4-bit precision. Moreover, we implement unstructured pruning on the encoder module to achieve >90 % active pixel compression at the latent space. The proposed autoencoder design has been validated against multiple benchmark DVS-based datasets including DVS-MNIST, N-Cars, DVS-IBM Gesture, and Prophesee Automotive Gen1 dataset. We achieve low accuracy drop of 2%, 3%, and 3.8% compared to the uncompressed baseline, with 7.08%, 1.36%, and 5.53 % active pixels in the images from the decoder (compression ratio of 13.1×, 29.1× , and 18.1× ) for DVS-MNIST, N-Cars, and DVS-IBM Gesture datasets, respectively. For the Prophesee Automotive Gen1 dataset, we achieve a minimal mAP drop of 0.07 from the baseline with 9% active pixels in the images from the decoder (compression ratio of 11.9× ).
0 Replies
Loading