Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach

Published: 01 Jan 2022, Last Modified: 19 Feb 2025JCDL 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The challenge of automatically extracting metadata from scientific PDF documents varies depending on the diversity of layouts within the PDF collection. In some disciplines such as German social sciences, the authors are not required to generate their papers according to a specific template and they often create their own templates which yield a high appearance diversity across publications. Overcoming this diversity using only Natural Language Processing (NLP) approaches is not always effective which is reflected in the metadata unavailability of a large portion of German social science publications. Therefore, we propose in this paper a multimodal neural network model that employs NLP together with Computer Vision (CV) for metadata extraction from scientific PDF documents. The aim is to benefit from both modalities to increase the overall accuracy of metadata extraction. The extensive experiments of the proposed model on around 8800 documents proved its effectiveness over unimodal models, with an overall F1 score of 92.3%.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview