Posthoc Verification and the Fallibility of the Ground Truth

Yifan Ding, Nicholas Botzer, Tim Weninger

Published: 14 Jul 2022, Last Modified: 16 Oct 2024Proceedings of the First Workshop on Dynamic Adversarial Data CollectionEveryoneCC BY 4.0

Abstract: Classifiers commonly make use of preannotated datasets, wherein a model is evaluated by pre-defined metrics on a held-out test set typically made of human-annotated labels. Metrics used in these evaluations are tied to the availability of well-defined ground truth labels, and these metrics typically do not allow for inexact matches. These noisy ground truth labels and strict evaluation metrics may compromise the validity and realism of evaluation results. In the present work, we conduct a systematic label verification experiment on the entity linking (EL) task. Specifically, we ask annotators to verify the correctness of annotations after the fact (i.e., posthoc). Compared to pre-annotation evaluation, state-of-the-art EL models performed extremely well according to the posthoc evaluation methodology. Surprisingly, we find predictions from EL models had a similar or higher verification rate than the ground truth. We conclude with a discussion on these findings and recommendations for future evaluations. The source code, raw results, and evaluation scripts are publicly available via the MIT license at https://github. com/yifding/e2e_EL_evaluate