Abstract: Personalized recommender systems play a crucial role in modern society, especially in e-commerce, news, and ads areas. Correctly evaluating and comparing candidate recommendation models is as essential as constructing ones. The common offline evaluation strategy is holding out some user-interacted items from training data and evaluating the performance of recommendation models based on how many items they can retrieve. Specifically, for any hold-out item or so-called target item for a user, the recommendation models try to predict the probability that the user would interact with the item and rank it among overall items, which is called global evaluation. Intuitively, a good recommendation model would assign high probabilities to such hold-out/target items. Based on the specific ranks, some metrics like Recall@K and NDCG@K can be calculated to further quantify the quality of the recommender model. Instead of ranking the target items among all items, Koren first proposed to rank them among a small sampled set of items, then quantified the performance of the models, which is called sampling evaluation. Ever since then, there has been a large amount of work adopting sampling evaluation due to its efficiency and frugality. In recent work, Rendle and Krichene argued that the sampling evaluation is “inconsistent” with respect to global metrics in terms of offline top-K metrics.
Loading