Abstract: This paper aims to answer the question of whether to use the impression log in evaluating news recommendation models. We start with a claim that the testing with the impression log composed of only hard-negative news (i.e., impression (IMP)-based test) is not beneficial to evaluating the models precisely. Based on the claim, we discuss a way of evaluating models by employing all kinds of negative news articles (i.e., Total test). Also, we propose a more-efficient way of evaluating models by sampling only a small number of negative articles (i.e., random-sampling (RS)-based test). We verify our claim by extensively comparing the evaluation results on six models from the IMP-based, Total, and RS-based tests: the RS-based test shows more accurate results than the IMP-based test in determining the superiority among the models while providing higher efficiency than the Total test. Therefore, our answer to the question above would be "do not employ the impression log in testing models even if it is available." This result is quite meaningful since it enables news recommendation researchers and practitioners, who have been using the impression log thus going to the wrong way, to turn to the right one.
Loading