Abstract: Classifiers commonly make use of preannotated datasets, wherein a model is evaluated by pre-defined metrics on a held-out test
set typically made of human-annotated labels.
Metrics used in these evaluations are tied to
the availability of well-defined ground truth labels, and these metrics typically do not allow
for inexact matches. These noisy ground truth
labels and strict evaluation metrics may compromise the validity and realism of evaluation
results. In the present work, we conduct a
systematic label verification experiment on the
entity linking (EL) task. Specifically, we ask
annotators to verify the correctness of annotations after the fact (i.e., posthoc). Compared to
pre-annotation evaluation, state-of-the-art EL
models performed extremely well according to
the posthoc evaluation methodology. Surprisingly, we find predictions from EL models had
a similar or higher verification rate than the
ground truth. We conclude with a discussion
on these findings and recommendations for future evaluations. The source code, raw results,
and evaluation scripts are publicly available
via the MIT license at https://github.
com/yifding/e2e_EL_evaluate
Loading