Dataset Regeneration for Cross Domain Recommendation

Dataset Regeneration for Cross Domain Recommendation

ICLR 2026 Conference Submission16305 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Recommender System, Cross-domain recommendation, Dataset Regeneration

Abstract: Cross-domain recommendation (CDR) has emerged as an effective strategy to mitigate data sparsity and cold-start challenges by transferring knowledge from a source domain to a target domain. Despite recent progress, two key issues remain: (i) Sparse overlap. In real-world datasets such as Amazon, the proportion of users active in both domains is extremely low, significantly limiting the effectiveness of many state-of-the-art CDR approaches. (ii) Negative transfer. Existing methods primarily address this problem at the model level, often assuming that logged interactions are unbiased and noise-free. In practice, however, recommender data contain numerous spurious correlations, and this issue is exacerbated in CDR due to domain heterogeneity. To address these challenges, we propose a dataset regeneration framework. First, we leverage a prediction model to generate a pool of high-confidence candidate interactions to link non-overlapping target-domain users and source-domain items. Second, inspired by causal inference, we introduce a filtering process designed to prune spurious interactions. This process identifies and removes not only noisy edges created during generation but also those from the original dataset, retaining only the interactions that have a positive causal effect on the target-domain performance. Through these two processes, we can regenerate a source-domain dataset that exhibits a tighter coupling and a more explicit causal connection with the target domain. By integrating our method with three representative recommendation backbones—LightGCN, BiTGCF, and CUT—we show that it significantly boosts their predictive accuracy on the target domain, achieving substantial gains of up to 23.81\% in Recall@10 and 22.22\% in NDCG@10.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16305

Loading