Abstract: Multi-modal recommendation aims at leveraging data of auxiliary modalities (e.g., linguistic descriptions and images) to enhance the representations of items, thereby accurately recommending items that users prefer from the vast expanse of Web-based data. Current multi-modal recommendation methods typically utilize multi-modal features to assist in learning item representations in a direct manner. However, the superfluous semantics in multi-modal features are ignored, resulting in the inclusion of excessive redundancy within the representations of items. Moreover, we disclose that multi-modal features of items rarely contain user-item interaction information. Hence, during the interaction among different item features, the user-item interaction information in ID-based representations diminishes, leading to the degeneration of recommendation performance. To this end, we propose a novel multi-modal recommendation approach, which compresses representations of extra modalities under the guidance of solid theoretical analysis and leverages two auxiliary multi-modal graphs to integrate user-item interaction information into multi-modal features. Empirical experiments on three multi-modal recommendation datasets demonstrate that our method outperforms benchmarks.
Loading