Keywords: Depth Up-Scaling, Memory Layer, Model Up-scaling
Abstract: Depth Up-Scaling (DUS) expands pre-trained language models by duplicating Transformer blocks, but also duplicates FFN-heavy components and increases parameter and compute costs. Studies on conditional computation and memory layers motivate lighter alternatives to dense FFN branches, while attention-head specialization suggests that such added capacity can be allocated more effectively at finer head-level granularity. We propose Memory-Infused Depth Up-Scaling (MIDUS), replacing duplicated FFN branches with memory layers for lightweight retrieval-based residual capacity. We instantiate the inserted memory layer as a Head-wise Memory Layer (HML), where each attention-head output queries a distinct product-key space, and Head-wise Implicit Value Expansion (HIVE) realizes the retrieved latent values through head-specific projections from a shared latent bank. Experiments show improved performance and efficiency, while a fixed-retrieval structural analysis characterizes head-specific value realization as a structurally distinct alternative to FFN-based residual expansion.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 70
Loading