Keywords: Automatic Differentation, Parallel Computing, Memory Efficient Neural Networks, Kernel Design, Deep Learning
TL;DR: This paper presents a unifying algebraic framework and a key theorem for (a class of) memory efficient neural network layers.
Abstract: Recent advances in memory-efficient neural network layers, such as FlashAttention, often appear as specialized engineering solutions but share a common mathematical structure. We show that many of these kernels can be understood as folds over commutative monoids, a perspective that unifies MapReduce-style computation with modern deep learning optimizations. Building on this, we introduce the Local Gradient Theorem, which provides a sufficient condition under which gradients of monoidal folds can be computed locally from the final output and individual inputs, enabling efficient backward passes. We demonstrate that attention, cross entropy, and two-layer MLPs all admit such monoid structures, recovering known memory-efficient kernels and extending the framework to new settings. This algebraic perspective offers a principled foundation for systematically designing memory- and cache-efficient layers, rather than discovering them in an ad-hoc manner.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 9115
Loading