Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: chain-of-thought, large language models, test-time scaling, space efficiency
Abstract: While recent works (e.g. o1, DeepSeek R1) have shown great promise of using long Chain-of-Thought (CoT) at test-time to improve reasoning capabilities of language models, it is challenging to scale up due to inefficient memory usage — intermediate computations accumulate indefinitely in the context even no longer needed for generating future thoughts. We propose PENCIL to address this limitation, which incorporates a reduction mechanism into the autoregressive generation process, allowing the model to recursively clean up intermediate thoughts in ways learned from training. With the reduction mechanism, the maximal context length during generation can decrease from the time complexity for solving the problem, which is often exponential for inherently hard tasks, to the actual space required, which is often polynomial. By using space efficiently, PENCIL can generate longer thoughts using small memory and thus solve larger-scale problems with more inference time. For example, we show PENCIL achieves almost perfect accuracy on the challenging Einstein's puzzle using a small 25M-transformer with 2048 context length. Theoretically, we show PENCIL can perform universal space-efficient computation by simulating Turing machines with optimal time and space complexity.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 74
Loading