Xolver: Generalist Reasoning and Problem Solving through Federated Multi-Agent Dynamics and Holistic Experience Learning

Xolver: Generalist Reasoning and Problem Solving through Federated Multi-Agent Dynamics and Holistic Experience Learning

ICLR 2026 Conference Submission12956 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning, Code Generation, Problem Solving, Math, Code, Agent, Multi-Agent, SWE, RAG, Agentic, ExperienceLearning

TL;DR: We develop a multi-agent in-context learning approach to boost up agentic and reasoning capabilities in foundation models.

Abstract: Despite rapid advances in complex reasoning, large language models (LLMs) largely operate in isolation, treating each problem as a fresh attempt without retaining or reusing prior experience. In contrast, expert problem solvers—such as Olympiad or programming contest teams—leverage a rich tapestry of experience: mentorship from coaches, intuition from past problems, mastery of tools and libraries, peer strategies, and continuous refinement through trial and error—even drawing insights from related problems under competition conditions. Inspired by this, we introduce Xolver—a training-free, generalist reasoning and problem solving framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver combines two key innovations: (i) a holistic experience-learning paradigm that unifies external and self-retrieval, tool use, collaborative agent interaction, agent-driven evaluation, and iterative reasoning refinement; and (ii) a dynamic multi-agent collaboration schema that departs from orchestration engineering, instead employing a federated learning strategy in which agents independently solve problems and aggregate their solutions. Extensive evaluations across reasoning, agentic, and coding benchmarks show that Xolver consistently outperforms specialized reasoning agents (e.g., OctoTools, Search-o1, AWorld, OpenHands, OAgents, Agent S2.5). Even with lightweight backbones (e.g., QWQ-32B), it frequently surpasses state-of-the-art proprietary models (Qwen3-235B, Gemini 2.5 Pro, o3, Deep Research, o4-mini-high). With stronger backbones (e.g., o3-mini-high), Xolver achieves new state-of-the-art scores: 94.4 on AIME'24, 93.7 on AIME'25, 91.6 on LiveCodeBench, 90.1 on GAIA, 71.7 on BrowseComp, 74.4 on OSWorld, 84.9 on SWE-bench Verified (bash only), 57.3 on HLE, 94.6 on GPQA Diamond, 84.4 on 2WIKI, and 82.3 on Bamboogle—highlighting holistic and federated experience learning as a crucial step toward dynamic, generalist agents capable of expert-level reasoning. We will open-source all code, and data.

Primary Area: generative models

Submission Number: 12956

Loading