Keywords: Associative Memory, Subquadratic Architectures, Test Time Scaling
Abstract: The demand for efficient inference has driven the development of subquadratic architectures as alternatives to the Transformer, though their capacity for complex, algorithmic reasoning remains a critical open question. To investigate the effect of architectural choice on downstream reasoning performance, we conduct a controlled study of reasoning scaling laws, training from scratch multiple hybrid-attention architectures of the same size (150M and 500M parameters) across three model classes (Mamba, Gated Linear Attention, Gated Delta Net) on a unified mathematical reasoning curriculum. Furthermore, we apply parallel test-time scaling methods via majority voting, and discover a clear trend that the amount of Attention layers increases reasoning performance. To investigate this trend, we analyze the models’ responses using LLM-as-a-Judge and categorize reasoning errors into 8 distinct types inspired by taxonomies in math education, identifying in-context associative recall as the primary error mode in attention-free architectures. As we move toward fully linear models without any attention layers, our findings establish a connection between the choice of architectural update rule and systematic failures. In particular, we find that hybrid models with Gated Delta Net can match and even exceed the performance of pure Transformers on mathematical reasoning. We present a principled empirical study that informs the design and evaluation of next-generation hybrid reasoning models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16337
Loading