From Recall To Reasoning: Understanding the Role of Associative Memory in Hybrid Architectures

ICLR 2026 Conference Submission16337 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Associative Memory, Subquadratic Architectures, Test Time Scaling
Abstract: The demand for efficient inference has driven the development of subquadratic architectures as alternatives to the Transformer, though their capacity for complex, algorithmic reasoning remains a critical open question. To investigate the effect of architectural choice on downstream reasoning performance, we conduct a controlled study of reasoning scaling laws, training from scratch multiple hybrid-attention architectures of the same size (150M parameters) across three model classes (Mamba, Gated Linear Attention, Gated Delta Net) on a unified mathematical reasoning curriculum. Furthermore, we apply parallel test-time scaling methods via majority voting, and uncover a clear trend showing an improvement in reasoning performance as we increase the amount the amount of Attention layers in the architecture. To explain this trend, we analyze the models’ responses using llm-as-a-judge and categorize its errors into 8 distinct types inspired by taxonomies in math education, identifying associative recall as the primary error mode in attention-free architectures. As we move toward fully linear models without any attention layers, our findings establish a connection between the choice of architectural update rule and systematic failures on reasoning primitives such as state-tracking and associative memory. We present a principled empirical study that informs the design and evaluation of next-generation hybrid reasoning models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16337
Loading