TourMLLM: A Retrieval-Augmented Multimodal Large Language Model for Multitask Learning in the Tourism Domain

Hiromasa Yamanishi, Ling Xiao, Toshihiko Yamasaki

Published: 01 Jan 2025, Last Modified: 02 Oct 2025ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Artificial Intelligence (AI) has shown significant potential in tourism, particularly in personalized recommendations, information provision, and user experience sharing. Recent advancements in multimodal large language models (MLLMs) have further enhanced AI-driven solutions. Although some MLLMs for tourism are proposed, existing models are constrained to a narrow range of tasks, limiting their effectiveness in providing information and personalization. Additionally, while some instruction tuning methods are proposed, effective learning methods that facilitate scaling for multi-task learning are still lacking in this domain. This paper proposes TourMLLM, a multimodal large language model designed to expand task coverage while improving training efficiency and accuracy. TourMLLM supports six key tasks, broadening its applications: landmark recognition, general review generation, conditional review generation, tourism recommendation, tourism image captioning with and without landmark names. To enhance adaptability and performance, we introduce task-adaptive retrieval-augmented instruction tuning and preference optimization strategies, allowing the model to handle diverse tourism-related tasks more effectively. Evaluation across six tasks demonstrates that TourMLLM outperforms GPT-4o in accuracy. The dataset and code are available online at https://github.com/HiromasaYamanishi/TourMLLM.

External IDs:dblp:conf/mir/Yamanishi0Y25