Enhancing Product Representation with Multi-form Interactions for Multimodal Conversational Recommendation

Published: 01 Jan 2023, Last Modified: 19 Feb 2025ACM Multimedia 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal Conversational Recommendation aims to find appropriate products based on a multi-turn dialogue, where user requests and products can be presented in both visual and textual modalities. While previous studies have focused on understanding user preferences from conversational contexts, the task of product modeling has been relatively unexplored. This study targets to fill this gap and demonstrates that information from multiple product views and cross-view interactions are essential for recommendation, along with dialog information. To this end, a product image is first encoded using a gated multi-view image encoder, and representations for the global and local views are obtained. On the textual side, two views are considered: the structure view (product attributes) and the sequence view (product description/reviews). Two forms of inter-modal interactions for product representation are then modeled: interactions between the global image view and the textual structure view, and interactions between the local image view and the textual sequence view. Furthermore, the representation is enhanced to attend to the latest user request in the dialog context, resulting in query-aware product representation. The experimental results indicate that our method, named Enteract, achieves state-of-the-art performance on two well-known datasets (MMD and SIMMC).
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview