SCREEN: A Benchmark for Situated Conversational Recommendation

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Engaging in conversational recommendations within a specific scenario represents a promising paradigm in the real world. Scenario-relevant situations often affect conversations and recommendations from two closely related aspects: varying the appealingness of items to users, namely $\textit{situated item representation}$, and shifting user interests in the targeted items, namely $\textit{situated user preference}$. We highlight that considering those situational factors is crucial, as this aligns with the realistic conversational recommendation process in the physical world. However, it is challenging yet under-explored. In this work, we are pioneering to bridge this gap and introduce a novel setting: $\textit{Situated Conversational Recommendation Systems}$ (SCRS). We observe an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To this end, we construct a new benchmark, named $\textbf{SCREEN}$, via a role-playing method based on multimodal large language models. We take two multimodal large language models to play the roles of a user and a recommender, simulating their interactions in a co-observed scene. Our SCREEN comprises over 20k dialogues across 1.5k diverse situations, providing a rich foundation for exploring situational influences on conversational recommendations. Based on the SCREEN, we propose three worth-exploring subtasks and evaluate several representative baseline models. Our evaluations suggest that the benchmark is high quality, establishing a solid experimental basis for future research. The code and data are available at https://github.com/DongdingLin/SCREEN.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Vision and Language, [Content] Multimodal Fusion
Relevance To Conference: This work contributes to multimedia/multimodal processing by introducing the novel paradigm of Situated Conversational Recommendation, which fundamentally enhances how recommendation systems can process and interpret multimodal inputs within varied situational contexts. Specifically, it underscores the importance of situational factors—such as the environment or context in which a conversation occurs—in shifting user preferences and item representations. This approach acknowledges that user interests and the relevance of items are not static but vary with context, necessitating a dynamic, situation-aware processing strategy. By constructing the SCREEN dataset through role-playing methods, this research provides a rich, diverse foundation that includes over 20,000 dialogues across 1,500 scenarios. This dataset is valuable for exploring how situational influences affect conversational recommendations, enabling the development of models that can interpret and respond to multimodal cues (e.g., textual, visual, and contextual information) within conversations. Evaluating several baseline models on the SCREEN dataset demonstrates its utility in enhancing the understanding and development of systems capable of sophisticated multimodal understanding and processing.
Supplementary Material: zip
Submission Number: 5331
Loading