Abstract: In a controlled factorial study of linear attention decay mechanisms, we find that the delta rule explains more variation than gating granularity, conditioning, or their interaction. We establish this by systematically evaluating all four quadrantsof the {scalar, channel-wise}
×{data-dependent, data-independent} decay space, crossed with the delta rule, yielding 8 controlled variants tested on two datasets (TinyStories, WikiText-103) with 3 random seeds each at 18M, and on TinyStories with 3 seeds at 125M and 5 variants at 42M. All delta variants (ranks 1–4) beat all non-delta variants (ranks 5–8) at 18M (both datasets) and 125M (TinyStories) and the gap consistently larger than within-group spread (6×at 125M). A granularity×delta interaction provides the second key finding: channel-wise decay hurts
without the delta rule but helps with it, suggesting that the delta rule is especially beneficial when multiple decay timescales are present. Within the top-4 delta variants, rankings are scale-dependent: data-independent StaticChannelDelta leads at 18M, while data-dependent
KDA overtakes it at 125M as the top-3 gap compresses to 0.009 nats. A synthetic recall probe reveals that gating granularity also creates a task-dependent tradeoff: scalar+delta solves exact retrieval while channel-wise+delta does not, indicating that channel-wise decay
trades retrieval precision for representational richness.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Maria_Lomeli2
Submission Number: 8185
Loading