Keywords: benchmark design, sequence modeling, long-context evaluation, state-space models, attention
TL;DR: A controlled 1021-task lag-kernel benchmark shows that structured task neighborhoods reveal and predict architecture preference regions that pooled long-context scores hide.
Abstract: Long-context benchmarks often report pooled scores over heterogeneous tasks, making it difficult to identify which dependency structures a model actually recovers. We propose a controlled benchmark chart for lag-structured dependencies. Each task is specified by a normalized causal kernel and represented by a lossy but interpretable descriptor $\Phi(w)=(s,P,T_\eta,D)$, measuring support density, peakiness, tail mass, and dispersion. We instantiate 1021 tasks across eight anchor, bridge, and stress families, and compare same-order lightweight full attention, sliding-window attention, diagonal SSM, and Mamba-like selective SSM heads. The resulting chart reveals architecture-task structure hidden by pooled reporting: a pooled diagnostic summary nearly ties the two best models (0.659 vs. 0.657), while distinct families have sharply different winners. Local neighborhoods in $\Phi$ predict held-out winners with 66.7% accuracy, outperforming family-, region-, and single-model baselines; a targeted three-seed rerun preserves winners on 97.5% of mid/high-gap tasks. Finally, two query-dependent bridge probes, QueriedDecay and AddressedDecay, suggest interpretable preference migration beyond the fixed-kernel face rather than immediate collapse. These results argue for benchmark designs that report structured task neighborhoods rather than only aggregate scores.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 48
Loading