Abstract: The effectiveness of a search engine is typically evaluated using hand-labeled datasets, where the labels indicate the relevance of documents to queries. Often the number of labels needed is too large to be created by the best annotators, and so less expensive labels (e.g., from crowdsourcing) are used. This introduces errors in the labels, and thus errors in standard effectiveness metrics (such as P@k and DCG). These errors must be taken into consideration when using the metrics. Previous work has approached assessor error by taking aggregates over multiple inexpensive assessors. We take a different approach and introduce equations and algorithms that can adjust the metrics to the values they would have had if there were no annotation errors.This is especially important when two search engines are compared on their metrics. We give examples where one engine appeared to be statistically significantly better than the other, but the effect disappeared after the metrics were corrected for annotation error. In other words, the evidence supporting a statistical difference was illusory and caused by a failure to account for annotation error.
External IDs:doi:10.1145/3186195
Loading