Abstract: Multi-modal learning has become a transformative approach in recommendation systems, leveraging diverse data types—such as visual, textual, and audio signals—to construct rich and comprehensive user preference profiles. Despite significant progress, existing methods often struggle with key challenges, including imbalanced data utilization, diverse inter-modal correlations, and task-specific variability, which limit their ability to fully exploit inter-modal relationships. To address these issues, we propose KANM$^{2}$L (KAN Enhanced Multi-modal Learning for Recommendation), a novel framework that integrates the strengths of the Kolmogorov-Arnold Network (KAN) with multi-modal learning. Specifically, KANM$^{2}$L: (1) introduces a KAN-enhanced dilated attention mechanism to effectively capture high-dimensional, complex visual dependencies, enabling scalable and efficient processing of intricate datasets; (2) employs a multi-modal adversarial network to align and fuse features across modalities, ensuring seamless integration and improved recommendation accuracy; and (3) incorporates a rotational loss function to stabilize and refine visual feature embeddings, leveraging historical interaction data for more consistent performance. Extensive experiments on real-world datasets demonstrate that KANM$^{2}$L achieves state-of-the-art performance, with improvements of up to 11.7% over existing methods. These findings underscore the potential of KANM$^{2}$L to advance the field of multi-modal recommendation systems by overcoming critical limitations and delivering robust, scalable performance across diverse recommendation tasks.
External IDs:doi:10.1109/tmm.2025.3623555
Loading