Abstract: Monocular depth estimation is a crucial task in many embedded vision systems with numerous applications in autonomous driving, robotics and augmented reality. Traditional methods often rely only on frame-based approaches, which struggle in dynamic scenes due to their limitations, while event-based cameras offer complementary high temporal resolution, though they lack spatial resolution and context. We propose a novel embedded multimodal monocular depth estimation framework using a hybrid spiking neural network (SNN) and artificial neural network (ANN) architecture. This framework leverages a custom accelerator, TransPIM for efficient transformer deployment, enabling real-time depth estimation on embedded systems. Our approach leverages the advantages of both frame-based and event-based cameras, where SNN extracts low-level features and generates sparse representations from events, which are then fed into an ANN with frame-based input for estimating depth. The SNN-ANN hybrid architecture allows for efficient processing of both RGB and event data showing competitive performance across different accuracy metrics in depth estimation with standard benchmark MVSEC and DENSE dataset. To make it accessible to embedded system we deploy it on TransPIM enabling 9x speedup and 183× lower energy consumption compared to standard GPUs opening up new possibilities for various embedded system applications.
Loading