Assessing Large Language Models (LLMs) as Foundational Recommenders: A Multi-Domain, Multi-Dataset Benchmark

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, foundation models, recommender system, multi-domain
TL;DR: we collect 15 datasets for evaluation and finetuning on foundational recommenders, and investigate over eight multi-domain evaluation settings
Abstract: Recent advances in large language models (LLMs) have raised the question of whether these models can serve as foundational recommenders across diverse domains. To systematically investigate this potential, we introduce RecBench-MD, a comprehensive benchmark for assessing the recommendation capabilities of LLMs from a multi-domain and multi-dataset perspective. Our benchmark encompasses 15 datasets spanning 10 domains, including e-commerce, entertainment, and social media, and evaluates 21 state-of-the-art LLMs under zero-resource, fine-tuning, and transfer-learning scenarios. Through extensive experiments, we reveal that (i) in-domain fine-tuning consistently delivers the strongest performance, (ii) cross-dataset transfer provides practical benefits in emerging recommendation contexts, and (iii) multi-domain training enhances the adaptability of LLM-based recommenders. These findings highlight both the opportunities and limitations of positioning LLMs as foundational recommenders. To support future research, we will publicly release all code and data.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10793
Loading