Rating consistency is consistently underrated

Denis Kotkov, Alan Medlar, Umesh Raj Satyal, Alexandr Maslov, Mats Neovius, Dorota Glowacka

Published: 25 Apr 2022, Last Modified: 09 Jan 2026Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, SAC 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Content-based and hybrid recommender systems rely on item-tag ratings to make recommendations. An example of an item-tag rating is the degree to which the tag "comedy"applies to the movie "Back to the Future (1985)". Ratings are often generated by human annotators who can be inconsistent with one another. However, many recommender systems take item-tag ratings at face value, assuming them all to be equally valid. In this paper, we investigate the inconsistency of item-tag ratings together with contextual factors that could affect consistency in the movie domain. We conducted semi-structured interviews to identify potential reasons for rating inconsistency. Next, we used these reasons to design a survey, which we ran on Amazon Mechanical Turk. We collected 6,070 ratings from 665 annotators across 142 movies and 80 tags. Our analysis shows that ∼45% of ratings are inconsistent with the mode rating for a given movie-tag pair. We found that the single most important factor for rating inconsistency is the annotator's perceived ease of rating, suggesting that annotators are at least tacitly aware of the quality of their own ratings. We also found that subjective tags (e.g. "funny", "boring") are more inconsistent than objective tags (e.g. "robots", "aliens"), and are associated with lower tag familiarity and lower perceived ease of rating.
Loading