AlphaApollo: A System for Deep Agentic Reasoning

Zhanke Zhou; Chentao Cao; Xiao Feng; Xuan Li; Zongze Li; Xiangyu Lu; Jiangchao Yao; Weikai Huang; Tian Cheng; Jianghangfan Zhang; Tangyu Jiang; Linrui Xu; Yiming Zheng; Brando Miranda; Tongliang Liu; Sanmi Koyejo; Masashi Sugiyama; Bo Han

AlphaApollo: A System for Deep Agentic Reasoning

Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, Bo Han

Published: 02 Mar 2026, Last Modified: 10 Apr 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, agent, tool-use, self-evolving, reinforcement learning

Abstract: We present AlphaApollo, a self-evolving agentic reasoning system that targets two bottlenecks in foundation-model reasoning: (1) limited capacity for long-horizon, multi-step problem solving and (2) unreliable test-time refinement without trustworthy verification. AlphaApollo orchestrates models and tools via three components: (i) multi-turn agentic reasoning, which formalizes model-environment interaction with structured tool calls and responses; (ii) multi-turn agentic learning, which applies turn-level reinforcement learning to optimize tool-use decisions while decoupling actions from tool responses for stable training; and (iii) multi-round agentic evolution, which refines solutions through a propose-judge-update loop with tool-assisted verifications and long-horizon memory. Across seven math reasoning benchmarks and multiple model scales, AlphaApollo improves performance through reliable tool use (>85% tool-call success), substantial gains from multi-turn RL (Avg@32: Qwen2.5-1.5B-Instruct 1.07% → 9.64%, Qwen2.5-7B-Instruct 8.77% → 20.35%), and improvements from evolution (e.g., Qwen2.5-3B-Instruct 5.27% → 7.70%, Qwen2.5-14B-Instruct 16.53% → 21.08%).

Submission Number: 235

Loading