InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Yuchen Yan; Liang Jiang; Jin Jiang; Shuaicheng Li; zujie wen; Zhiqiang Zhang; JUN ZHOU; Jian Shao; Yueting Zhuang; Yongliang Shen

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, zujie wen, Zhiqiang Zhang, JUN ZHOU, Jian Shao, Yueting Zhuang, Yongliang Shen

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

Lay Summary: Large language models can solve difficult problems by writing long step-by-step reasoning, but this often becomes slow, expensive, and unreliable when the reasoning gets too long. Important earlier information may be forgotten, and the model may run out of usable context before reaching an answer. We propose InftyThink+, a method that teaches language models to reason in multiple shorter rounds. Instead of producing one very long response, the model periodically pauses, summarizes the most important intermediate conclusions, and continues reasoning from this compact summary. Through reinforcement learning, the model learns not only how to write these summaries, but also when to summarize, what information to keep, and how to continue effectively afterward. Our experiments show that this approach improves reasoning accuracy on challenging math and science benchmarks while also reducing inference time. This makes advanced reasoning models more practical, cheaper to run, and potentially more energy-efficient. InftyThink+ suggests that better reasoning is not just about thinking longer, but about learning how to manage and reuse intermediate thoughts more strategically.

Link To Code: https://zju-real.github.io/InftyThink-Plus/

Primary Area: Deep Learning->Large Language Models

Keywords: LLM Reasoning, Efficient Reasoning, Reinforcement Learning

Originally Submitted PDF: pdf

Submission Number: 1895

Loading