Mind the Gap to Trustworthy LLM Agents: A Systematic Evaluation on Constraint Satisfaction for Real-World Travel Planning

Mind the Gap to Trustworthy LLM Agents: A Systematic Evaluation on Constraint Satisfaction for Real-World Travel Planning

AAAI 2026 Workshop TrustAgent Submission83 Authors

Published: 20 Nov 2025, Last Modified: 09 Mar 2026AAAI 2026 TrustAgent Workshop OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: TrustAgent; LLM-Agent; Travel Planning; Neuro-Symbolic AI

Abstract: Large language model (LLM) agents are increasingly claimed to handle complex, multi-step tasks, yet their trustworthiness in real-world task remains under-examined. Recent work on travel planning has already pointed out that constraint satisfaction is a persistent bottleneck, especially when itineraries must respect spatio-temporal feasibility, user-specific preferences, and budget or resource limits. However, these observations are mostly made in isolation: they are tied to a single dataset or a particular agent design, which makes it hard to tell whether the weakness is fundamental to current LLM agents or accidental to the setup. This paper presents a systematic examination of travel planning. We present a comprehensive review of existing travel-planning benchmarks, summarizing their design trends and highlighting the new challenges arising from these developments. We also categorize prevailing approaches into general-purpose agent, multi-agent system, and neuro-symbolic approach, and analyze their respective trade-offs between generalizability and domain adaptability. Modular ability analyses are introduced to analyze model performance across them, enabling a deeper investigation into the diverse capabilities required for successful travel planning and revealing the limitations of current methods. We find that significant challenges remain in recognizing open constraints, extracting information under constraints, and reasoning under constraints. Although these complex problems are challenging to tackle as a whole, by decomposing them into manageable sub-tasks, there remains a promising path toward achieving trustworthy agents.

Submission Number: 83

Loading