Scaling NVMs in Event-Driven Architectures for Edge Inference

Published: 01 Jan 2024, Last Modified: 15 May 2025APCCAS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Edge devices running machine learning (ML) inference, such as in automotive systems, drones, and wearables, face constraints in power, form factor, and cost, while needing high-performance computing. Increasing ML model sizes and onchip SRAM memory requirements present bottlenecks for system efficiency, especially in advanced technology nodes. Emerging memory technologies and event-based AI computing have the potential to overcome these challenges. High-speed MRAM can reduce the power and area overheads of on-chip SRAM for neural network weights storage, and event-driven dataflow can lower inference latency and power consumption. This study models and analyzes the PPA scaling of SRAM and MRAM macros from a 22nm planar CMOS (N22), 12nm FinFET CMOS (N12), down to imec 14 Å gate-all-around nanosheet CMOS (A14) technology node. We evaluate MRAM versus SRAM for weight storage in an event-driven multicore inference architecture, simulating different workloads. Our results show that STT-MRAM reduces total energy by $\sim \mathbf{2 5 \%}$ and leakage energy by $\sim \mathbf{4 0 \%}$ per inference in N22 and N12 nodes, though scaling to A14 VGSOT-MRAM increases energy consumption and latency. Despite this, MRAM weight storage reduces area by $\sim \mathbf{4 0 \%}$ across all nodes. This work provides insights into facilitating the future development of edge computing technologies using emerging non-volatile memories.
Loading