Sample Efficiency Matters: Training Multimodal Conversational Recommendation Systems in a Small Data Setting

Published: 01 Jan 2024, Last Modified: 13 May 2025ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the increasing prevalence of virtual assistants, multimodal conversational recommendation systems (multimodal CRS) becomes essential for boosting customer engagement, improving conversion rates, and enhancing user satisfaction. Yet conversational samples, as training data for such a system, are difficult to obtain in large quantities, particularly in new platforms. To effectively train multimodal CRS in a small data setting, we enhance data quality to make up for the small data quantity by augmenting conversations with dialogue states. We then devise an effective dialogue state encoder to bridge the semantic gap between conversation and product representations for recommendation. To further reduce the cost of dialogue state annotation, a semi-supervised learning method is developed to effectively train the dialogue state encoder with a small set of labeled conversations. In addition, we design a correlation regularisation that leverages knowledge in the multimodal product database to help align textual and visual modalities. Experiments on the dataset MMD demonstrate the effectiveness of our method. Particularly, with only 5% of the MMD training set, our method (namely SeMANTIC) obtains better NDCG scores than those of baseline models trained on the full MMD training set.
Loading