Abstract: Serving transformer language models efficiently is constrained by the significant memory footprint of the Key-Value (KV) cache. While recent optimizations focus on compressing the cache along the temporal axis, we argue that the depth dimension offers a robust, orthogonal avenue for improving cache efficiency. Existing cross-layer sharing methods often suffer from throughput or latency overhead. In this work, we introduce Random Cross-Layer Attention (R-CLA), a training scheme where layers stochastically attend to either their own KV states or those of a preceding layer. This simple approach decouples layers from specific features, enabling flexible depth-wise cache sharing at inference time. We demonstrate that R-CLA allows for significant memory savings and acts as a regularizer that improves generalization in larger models.
Submission Number: 14
Loading