TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

Yincen Qu; huan xiao; Feng Li; Gregory Li; Hui Zhou; DaiXiangying

TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

Yincen Qu, huan xiao, Feng Li, Gregory Li, Hui Zhou, DaiXiangying

02 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: travel plan, reasoning, reinforcement learning

Abstract: Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). However, existing benchmarks primarily equate planning ability with solving rigid constraint satisfaction problems. Solvers that excel at synthetic logic puzzles often fail to handle the ambiguity of real-world user intents. To address this, we present TripScore, a behavior-grounded benchmark and evaluation framework designed to align agent development with real-world utility. We release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. We propose a unified evaluation reward that fuses feasibility and quality into a granular scalar reward. Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. Leveraging TripScore, we conduct extensive experiments across diverse paradigms, including neuro-symbolic solvers, test-time search and fine-tuning. Our results reveal that while rigid solvers flounder on real-world queries, RL fine-tuning (e.g., GRPO) utilizing our unified reward significantly outperforms other methods with the same base model, effectively bridging the gap between open-source models and proprietary baselines in authentic travel planning scenarios.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 945

Loading