AlignRec: Aligning and Training in Multimodal Recommendations
Abstract: With the development of multimedia systems, multimodal recommendations are playing an essential role, as they can leverage rich
contexts beyond interactions. Existing methods mainly regard multimodal information as an auxiliary, using them to help learn ID
features; However, there exist semantic gaps among multimodal
content features and ID-based features, for which directly using
multimodal information as an auxiliary would lead to misalignment
in representations of users and items. In this paper, we first systematically investigate the misalignment issue in multimodal recommendations, and propose a solution named AlignRec. In AlignRec, the
recommendation objective is decomposed into three alignments,
namely alignment within contents, alignment between content
and categorical ID, and alignment between users and items. Each
alignment is characterized by a specific objective function and is
integrated into our multimodal recommendation framework. To
effectively train AlignRec, we propose starting from pre-training
the first alignment to obtain unified multimodal features and subsequently training the following two alignments together with
these features as input. As it is essential to analyze whether each multimodal feature helps in training and accelerate the iteration
cycle of recommendation models, we design three new classes
of metrics to evaluate intermediate performance. Our extensive
experiments on three real-world datasets consistently verify the
superiority of AlignRec compared to nine baselines. We also find
that the multimodal features generated by AlignRec are better than
currently used ones, which are to be open-sourced in our repository
https://github.com/sjtulyf123/AlignRec_CIKM24.
Loading