Automatically Extracting Scientific Metrics with LLMs: A Case Study of ImageNet Papers

Mengli Duan; Michael Guerzhoy

Automatically Extracting Scientific Metrics with LLMs: A Case Study of ImageNet Papers

Mengli Duan, Michael Guerzhoy

Published: 24 Sept 2025, Last Modified: 30 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: scientific metrics, LLMs, self-verification, ImageNet

Abstract: In this paper, we introduce a large-scale dataset of papers annotated with their reported Top-1 accuracy on the ImageNet test set, and compare existing and new automatic metric extraction methods, along with a detailed qualitative error analysis. Our study highlights common reporting challenges—such as ambiguous dataset references, table-only metrics, and missing Top-1 values—that drive extraction errors. We curate and release a dataset of 200 manually annotated ImageNet classification papers, larger than prior work, and evaluate our pipeline against both existing approaches and ablated baselines.

Submission Number: 140

Loading