Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory, and Test Time Compute Scaling
Keywords: sequence models, reasoning, transformers, state-space models, recurrent models, recurrent transformers, adaptive computation time, computation expressivity, computational complexity, chain-of-thoughts, reinforcement learning
TL;DR: We investigate how training methods and model architecture influence multi-step reasoning performance within the cellular automata framework. We show that reasoning depth can be significantly extended with recurrence, memory and test-time compute.
Abstract: Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint train/test rules. Models are trained on short state sequences, required to _infer_ the hidden local rule, and then _chain_ it for multiple future steps. We find that most neural architectures learn the rule and achieve high next-step accuracy, but performance drops sharply as the required number of steps increases. Increasing model depth is crucial, and extending _effective_ depth via recurrence, memory, or test-time compute improves results but remains bounded. Complementing these controlled experiments, a natural-language proxy game shows that contemporary LLMs largely fail on the complex setting. Together, these results separate genuine rule induction from memorisation, quantify how difficulty scales with reasoning depth, and highlight the joint roles of architecture and training/inference procedures.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9596
Loading