Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Anastasiia Filippova; David Grangier; marco cuturi; Joao Monteiro

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Anastasiia Filippova, David Grangier, marco cuturi, Joao Monteiro

Published: 01 Mar 2026, Last Modified: 05 Apr 2026TTU at ICLR 2026 (Main)EveryoneRevisionsBibTeXCC BY 4.0

Abstract: Serving transformer language models efficiently is constrained by the significant memory footprint of the Key-Value (KV) cache. While recent optimizations focus on compressing the cache along the temporal axis, we argue that the depth dimension offers a robust, orthogonal avenue for improving cache efficiency. Existing cross-layer sharing methods often suffer from throughput or latency overhead. In this work, we introduce Random Cross-Layer Attention (R-CLA), a training scheme where layers stochastically attend to either their own KV states or those of a preceding layer. This simple approach decouples layers from specific features, enabling flexible depth-wise cache sharing at inference time. We demonstrate that R-CLA allows for significant memory savings and acts as a regularizer that improves generalization in larger models.

Submission Number: 14

Loading