COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-turn tool calling, preference optimization, reasoning and planning
TL;DR: COMPASS is a benchmark that tests whether LLM agents can intelligently optimize user preferences in realistic, tool-based travel planning tasks.
Abstract: Real-world tasks like travel planning require LLM agents to satisfy hard constraints (dates, budget) while optimizing user's utility preferences (cheapest hotel, most convenient flights). We formalize this as *constrained preference optimization*, where agents strategically use tools to gather information and compare options to optimize user's preferences. We introduce **COMPASS** (**C**onstrained **O**ptimization through **M**ulti-turn **P**lanning **a**nd **S**trategic **S**olutions), a benchmark evaluating agents through realistic travel planning. We build a travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, plus a tool ecosystem mirroring commercial booking platforms. Evaluating state-of-the-art models reveals a significant **acceptable–optimal gap**: models achieve 85-95% constraint satisfaction but only 60-70% preference optimization, settling for feasible rather than optimal solutions. Performance degrades sharply on multi-service coordination tasks. Our tool-use analysis shows task success strongly correlates with information gathering—insufficient exploration is the primary bottleneck, though future models should prioritize efficient over exhaustive search. *COMPASS* provides a rigorous benchmark for diagnosing core challenges in constrained preference optimization and guiding development of user-aligned agents.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 8401
Loading