TripTide: A Benchmark for Adaptive Travel Planning under Disruptions

ACL ARR 2026 January Submission5057 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Disruption based Travel Planning
Abstract: Recent work, such as TripCraft and TravelPlanner, has shown the promise of Large Language Models (LLMs) for personalised, constraint aware travel itinerary generation, but real-world travel often involves disruptions. To address this gap, we introduce TripTide, the first benchmark for evaluating LLMs’ ability to revise itineraries under realistic disruptions. TripTide models disruption severity and traveler tolerance, enabling systematic evaluation of LLM responses to events such as transit cancellations, weather closures, and over-booked attractions. We conduct a three-fold evaluation: (i) automatic metrics measuring Preservation of Intent, Responsiveness, and Adaptability (semantic, spatial, and sequential), (ii) an LLM-as-a-Judge evaluation, and (iii) a human study assessing revision quality. Our findings show that LLMs largely preserve semantic and sequential structure, while spatial deviations are higher for shorter itineraries and diminish for longer ones. However, disruption-handling performance declines as itinerary length increases. TripTide provides a foundation for benchmarking robustness and adaptability in LLM-based travel planning.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,NLP datasets,evaluation,metrics
Contribution Types: Data resources
Languages Studied: English
Submission Number: 5057
Loading