Keywords: travel plan, reasoning, reinforcement learning
Abstract: Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). However, existing benchmarks primarily equate planning ability with solving rigid constraint satisfaction problems. Solvers that excel at synthetic logic puzzles often fail to handle the ambiguity of real-world user intents. To address this, we present TripScore, a behavior-grounded benchmark and evaluation framework designed to align agent development with real-world utility. We release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. We propose a unified evaluation reward that fuses feasibility and quality into a granular scalar reward. Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. Leveraging TripScore, we conduct extensive experiments across diverse paradigms, including neuro-symbolic solvers, test-time search and fine-tuning. Our results reveal that while rigid solvers flounder on real-world queries, RL fine-tuning (e.g., GRPO) utilizing our unified reward significantly outperforms other methods with the same base model, effectively bridging the gap between open-source models and proprietary baselines in authentic travel planning scenarios.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 945
Loading