Sample Efficiency Matters: Training Multimodal Conversational Recommendation Systems in a Small Data Setting

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the increasing prevalence of virtual assistants, multimodal conversational recommendation systems (multimodal CRS) becomes essential for boosting customer engagement, improving conversion rates, and enhancing user satisfaction. Yet conversational samples, as training data for such a system, are difficult to obtain in large quantities, particularly in new platforms. Motivated by this challenge, we aim to design innovative methods for training multimodal CRS effectively even in a small data setting. Specifically, assuming the availability of a small number of samples with dialogue states, we devise an effective dialogue state encoder to bridge the semantic gap between conversation and product representations for recommendation. To reduce the cost of dialogue state annotation, a semi-supervised learning method is developed to effectively train the dialogue state encoder with a small set of labeled conversations. In addition, we design a correlation regularisation that leverages knowledge in the multimodal product database to better align textual and visual modalities. Experiments on the dataset MMD demonstrate the effectiveness of our method. Particularly, with only 5% of the MMD training set, our method (namely SeMANTIC) obtains better NDCG scores than those of baseline models trained on the full MMD training set.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Vision and Language, [Systems] Data Systems Management and Indexing
Relevance To Conference: Our research advances multimedia/multimodal processing by introducing a novel approach for training a multimodal conversational recommendation system in low data environments, called SeMANTIC. This method enriches dialogue and product representations by incorporating dialogue states and a regularization term, leveraging the rich multimodal information from the multimodal product database. Additionally, to mitigate the expense of annotating dialogue states, we employ a teacher-student framework to acquire dialogue state embeddings from conversations lacking explicit dialogue state annotations.
Supplementary Material: zip
Submission Number: 3053
Loading