SciFig: A Scientific Figure Dataset for Figure UnderstandingDownload PDF

Anonymous

05 Jun 2022 (modified: 05 May 2023)ACL ARR 2022 June Blind SubmissionReaders: Everyone
Keywords: figure extraction, scientific figure understanding, scholarly big dataset
Abstract: Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large-scale retrieval services for non-textual components such as scientific figures and tables. One challenge towards such services is scientific figure understanding that represents visual information by text. A key problem is a lack of datasets containing annotated scientific figures and tables, which can be used for classification, question-answering, and auto-captioning. Here, we design a pipeline that extracts figures and tables from scientific literature and a deep-learning-based framework that classifies scientific figures using visual features. Using this pipeline, we develop the first large-scale annotated corpus, SciFig, consisting of more than 264k scientific figures extracted from $\approx56$k research papers in the ACL Anthology. We make available the SciFig-Pilot dataset that contains 1671 manually labeled scientific figures belonging to 19 different categories. The dataset is publicly accessible at \url{https://bit.ly/3m4u0eq}.
Paper Type: short
Editor Reassignment: yes
Reviewer Reassignment: yes
0 Replies

Loading