Keywords: scientific metrics, LLMs, self-verification, ImageNet
Abstract: In this paper, we introduce a large-scale dataset of papers annotated with their reported Top-1 accuracy on the ImageNet test set, and compare existing and new automatic metric extraction methods, along with a detailed qualitative error analysis. Our study highlights common reporting challenges—such as ambiguous dataset references, table-only metrics, and missing Top-1 values—that drive extraction errors. We curate and release a dataset of 200 manually annotated ImageNet classification papers, larger than prior work, and evaluate our pipeline against both existing approaches and ablated baselines.
Submission Number: 140
Loading