Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

Siddarth Venkatraman; Vineet Jain; Sarthak Mittal; Vedant Shah; Johan Obando-Ceron; Yoshua Bengio; Brian R. Bartoldson; Bhavya Kailkhura; Guillaume Lajoie; Glen Berseth; Nikolay Malkin; Moksh Jain

Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, Moksh Jain

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, reasoning, reinforcement learning, test-time scaling

TL;DR: We introduce Recursive Self-Aggregation,a test-time scaling method inspired by genetic algorithms,where an LLM iteratively aggregates its own reasoning traces to refine and improve solutions.

Abstract: Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled *in parallel* by choosing among multiple independent solutions or *sequentially* through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. RSA exploits the rich information embedded in the reasoning chains -- not just the final answers -- and enables bootstrapping from partially correct intermediate steps within different chains of thought. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), while outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further demonstrate that training the model to combine solutions via a novel aggregation-aware reinforcement learning approach yields significant performance gains.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9263

Loading