Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory, and Test Time Compute Scaling

Ivan Rodkin; Daniil Orel; Konstantin Smirnov; Arman Bolatov; Bilal Elbouardi; Besher Hassan; Yuri Kuratov; Aydar Bulatov; Preslav Nakov; Timothy Baldwin; Artem Shelmanov; Mikhail Burtsev

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory, and Test Time Compute Scaling

Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev

17 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: sequence models, reasoning, transformers, state-space models, recurrent models, recurrent transformers, adaptive computation time, computation expressivity, computational complexity, chain-of-thoughts, reinforcement learning

TL;DR: We investigate how training methods and model architecture influence multi-step reasoning performance within the cellular automata framework. We show that reasoning depth can be significantly extended with recurrence, memory and test-time compute.

Abstract: Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint train/test rules. Models are trained on short state sequences, required to _infer_ the hidden local rule, and then _chain_ it for multiple future steps. We find that most neural architectures learn the rule and achieve high next-step accuracy, but performance drops sharply as the required number of steps increases. Increasing model depth is crucial, and extending _effective_ depth via recurrence, memory, or test-time compute improves results but remains bounded. Complementing these controlled experiments, a natural-language proxy game shows that contemporary LLMs largely fail on the complex setting. Together, these results separate genuine rule induction from memorisation, quantify how difficulty scales with reasoning depth, and highlight the joint roles of architecture and training/inference procedures.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9596

Loading