AlphaApollo: A System for Deep Agentic Reasoning
Keywords: reasoning, agent, tool-use, self-evolving, reinforcement learning
Abstract: We present AlphaApollo, a self-evolving agentic reasoning system that targets two bottlenecks in foundation-model reasoning: (1) limited capacity for long-horizon, multi-step problem solving and (2) unreliable test-time refinement without trustworthy verification. AlphaApollo orchestrates models and tools via three components: (i) multi-turn agentic reasoning, which formalizes model-environment interaction with structured tool calls and responses; (ii) multi-turn agentic learning, which applies turn-level reinforcement learning to optimize tool-use decisions while decoupling actions from tool responses for stable training; and (iii) multi-round agentic evolution, which refines solutions through a propose-judge-update loop with tool-assisted verifications and long-horizon memory. Across seven math reasoning benchmarks and multiple model scales, AlphaApollo improves performance through reliable tool use (>85% tool-call success), substantial gains from multi-turn RL (Avg@32: Qwen2.5-1.5B-Instruct 1.07% → 9.64%, Qwen2.5-7B-Instruct 8.77% → 20.35%), and improvements from evolution (e.g., Qwen2.5-3B-Instruct 5.27% → 7.70%, Qwen2.5-14B-Instruct 16.53% → 21.08%).
Submission Number: 235
Loading