Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Qijiong Liu; Jieming Zhu; Lu Fan; Kun Wang; Hengchang Hu; Wei Guo; Yong Liu; Xiao-Ming Wu

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

Qijiong Liu, Jieming Zhu, Lu Fan, Kun Wang, Hengchang Hu, Wei Guo, Yong Liu, Xiao-Ming Wu

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, recommender system, large language models

TL;DR: LLM-as-RS paradigm is not always good: we benchmark it with traditional recommendation models

Abstract: Integrating large language models (LLMs) into recommender systems has created new opportunities for improving recommendation quality. However, a comprehensive benchmark is needed to thoroughly evaluate and compare the recommendation capabilities of LLMs with traditional recommender systems. In this paper, we introduce \recbench{}, which systematically investigates various item representation forms (including unique identifier, text, semantic embedding, and semantic identifier) and evaluates two primary recommendation tasks, i.e., click-through rate prediction (CTR) and sequential recommendation (SeqRec). Our extensive experiments cover up to 17 large models and are conducted across five diverse datasets from fashion, news, video, books, and music domains. Our findings indicate that LLM-based recommenders outperform conventional recommenders, achieving up to a 5% AUC improvement in CTR and up to a 170% NDCG@10 improvement in SeqRec. However, these substantial performance gains come at the expense of significantly reduced inference efficiency, rendering LLMs impractical as real-time recommenders. We have released our code and data to enable other researchers to reproduce and build upon our experimental results.

Croissant File: json

Dataset URL: https://www.kaggle.com/datasets/qijiong/recbench

Code URL: https://github.com/Jyonn/RecBench

Primary Area: Evaluation (e.g., data collection methodology, data processing methodology, data analysis methodology, meta studies on data sources, extracting signals from data, replicability of data collection and data analysis and validity of metrics, validity of data collection experiments, human-in-the-loop for data collection, human-in-the-loop for data evaluation)

Submission Number: 654

Loading