A New Dataset for Fine-Grained Citation Field Extraction

Sam Anzaroot, Andrew McCallum

May 10, 2013 (modified: May 10, 2013) ICML 2013 PeerReview submission readers: everyone
  • Decision: oral
  • Abstract: Citation field extraction entails segmenting a citation string into its constituent parts, such as title, authors, publisher and year. Despite the importance of this task, there is a lack of well-annotated citation data. This paper presents a new labeled dataset for citation extraction that, in comparison to the previous standard dataset, exceeds four-times more data, sup- plies detailed nested labels rather than coarse-grained flat labels, and is derived from four different academic fields rather than one. We describe our new dataset in detail, and provide baseline experimental results from a state-of-the-art extraction method.