Track: long paper (up to 5 pages)
Keywords: Test Time Scaling, Chain Of Thought, Reasoning, Mathematics, Associative Memory
TL;DR: Mamba-based models drastically underperform Transformers in mathematical reasoning due to deficiencies in associative memory.
Abstract: The emerging paradigm of scaling test-time compute--enhancing model performance by scaling up chain of thought reasoning--is gaining significant traction in the deep learning community. While effective, these methods incur substantial computational costs at inference time due to the quadratic memory complexity of Transformers with respect to sequence length. Recently, subquadratic architectures such as Mamba have emerged which approach the performance of Transformers on language tasks while showcasing significant improvements in computational efficiency on long sequences. In this paper, we present the first empirical investigation into test-time compute scaling for subquadratic architectures. Our findings reveal that while these models do benefit from increase test-time compute, their gains are consistently lower than those observed in Transformers. We find that this limitation is correlated with their reduced capabilities for in-context associative memory, which hinder reasoning over extended sequences. These results shed light on the trade-offs between computational efficiency and reasoning capabilities in modern architectures, providing a foundation for future research on designing models for both test-time compute scalability and long-chain reasoning.
Submission Number: 24
Loading