Cross-Modal Factor Reasoning with LLMs: Toward Semantic-Structured Generalization for Recommendation

Wei Yang; Rui Zhong; Ze-Yu Song; Hengwei Ju; Yuecheng Li; Yiqun Chen; Ching Chang; Gengshuo Liu; Chi Lu; Peng Jiang

Cross-Modal Factor Reasoning with LLMs: Toward Semantic-Structured Generalization for Recommendation

Wei Yang, Rui Zhong, Ze-Yu Song, Hengwei Ju, Yuecheng Li, Yiqun Chen, Ching Chang, Gengshuo Liu, Chi Lu, Peng Jiang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Recommendation, Large Language Model, Semantic Reasoning

Abstract: Multimodal recommendation aims to enhance personalization by leveraging content signals such as text and images. However, existing methods often treat modalities as shallow auxiliary inputs, fusing raw embeddings without reasoning about what semantics are useful or how they influence user preference. Content-based graphs typically rely on low-level similarity, lacking structured semantic relations such as functionality or style. Moreover, collaborative signals are used solely for ranking, without grounding content semantics. To address these limitations, we present MARS, a framework for Cross-Modal FActor Reasoning with LLMs, enabling Semantic-Structured Generalization in recommendation. MARS introduces a cognitively guided paradigm that prompts large language models (LLMs) to extract human-interpretable semantic factors (e.g., functionality, material and usage scenario) from raw visual and textual descriptions. These structured factors are used to build heterogeneous graphs that capture multi-aspect semantic relations among items. To integrate semantics into representation learning, we propose an auxiliary semantic prediction task that aligns collaborative embeddings with LLM-inferred factor knowledge. In addition, a cross-modal consistency loss encourages agreement across semantic views from different modalities. Extensive experiments show that MARS achieves superior accuracy and generalization compared to state-of-the-art multimodal baselines and LLM-based methods.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23023

Loading