COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

Guangya Wan; Mingyang Ling; Xiaoqi Ren; Rujun Han; Sheng Li; Zizhao Zhang

COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

Guangya Wan, Mingyang Ling, Xiaoqi Ren, Rujun Han, Sheng Li, Zizhao Zhang

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Context Engineering, Multi-Agent systems, Long Horizon Tasks, LLM Agents

TL;DR: We introduce COMPASS, a dual-loop multi-agent framework that uses context management and strategic oversight to make LLM agents reliable on long-horizon tasks.

Abstract: Long-horizon tasks requiring many rounds of reasoning and tool use remain challenging for LLM agents, as small mistakes compound across steps and even state-of-the-art models could produce unexpected or hallucinated tool outputs. We identify ineffective context management as the core bottleneck: as execution unfolds, unstructured histories cause agents to overlook critical evidence or become overwhelmed by irrelevant information. To address this, we introduce COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context management into three specialized components: (1) a Main Agent that executes reasoning and tool calls, (2) a Meta-Thinker that monitors execution and issues strategic signals, and (3) a Context Manager that maintains concise, strategically relevant summaries. This design preserves single-agent fluidity while enabling adaptive context organization throughout execution. Across three challenging benchmarks—GAIA, BrowseComp, and Humanity's Last Exam—COMPASS improves accuracy by over 10% compared to both single- and multi-agent baselines, with ablation studies confirming designed components as crucial for long-horizon reasoning, test-time scaling extensions that boost performance by up to 20% (matching established DeepResearch Agents), and a post-training optimization pipeline improving token efficiency by 25%.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 15727

Loading