Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Published: 18 Dec 2025, Last Modified: 06 Apr 2026CVPR 2026EveryonearXiv.org perpetual, non-exclusive license
Abstract: As efficient alternatives to softmax Attention, linear State- Space Models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall-oriented tasks. We propose Gated KalmaNet (GKA), a layer that accounts for the full past while maintaining SSM-style effi- ciency. We ground our approach in the Kalman Filter (KF) framework, which provides a principled solution for opti- mal inference in dynamical systems. We show that several existing SSM layers (DeltaNet, Gated DeltaNet, and Kimi Delta Attention) are approximations to the KF recurrence that assume identity error covariance, thereby ignoring how past measurements (keys and values) should optimally in- fluence state updates. In contrast, GKA computes the exact Kalman gain by maintaining the full error covariance. Un- der a steady-state assumption that enables parallelization, this reduces to solving an online ridge regression problem with constant memory and linear compute cost. A critical insight is that standard KF equations are numerically unsta- ble in low-precision environments (like bfloat16) and hard to parallelize on modern hardware. We address this through: (1) adaptive regularization with input-dependent gating to control the condition number of the ridge regression for numerical stability, and (2) Chebyshev Iteration, which we show is more stable than conventional iterative solvers in low-precision settings. We further develop hardware-aware chunk-wise kernels to enable efficient training. Empirically, GKA outperforms existing SSM layers (like Mamba2 and Gated DeltaNet) on short-context tasks and achieves more than 10% relative improvement on long-context RAG and LongQA tasks up to 128k tokens.
Loading