A New Dataset for Fine-Grained Citation Field Extraction

Sam Anzaroot; Andrew McCallum

A New Dataset for Fine-Grained Citation Field Extraction

Sam Anzaroot, Andrew McCallum

25 Oct 2025 (modified: 10 May 2013)ICML 2013 PeerReviewReaders: Everyone

Decision: oral

Abstract: Citation field extraction entails segmenting a citation string into its constituent parts, such as title, authors, publisher and year. Despite the importance of this task, there is a lack of well-annotated citation data. This paper presents a new labeled dataset for citation extraction that, in comparison to the previous standard dataset, exceeds four-times more data, sup- plies detailed nested labels rather than coarse-grained flat labels, and is derived from four different academic fields rather than one. We describe our new dataset in detail, and provide baseline experimental results from a state-of-the-art extraction method.

0 Replies

Loading