Editorial: Mining Scientific Papers: NLP-enhanced Bibliometrics

Iana Atanassova; Marc Bertin; Philipp Mayr

Editorial: Mining Scientific Papers: NLP-enhanced Bibliometrics

Iana Atanassova, Marc Bertin, Philipp Mayr

Published: 01 Jan 2019, Last Modified: 19 Feb 2025Frontiers Res. Metrics Anal. 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: INTRODUCTIONThe Research Topic on ”NLP-enhanced Bibliometrics” aims to promote interdisciplinary research inbibliometrics, Natural Language Processing (NLP) and computational linguistics in order to enhance the waysbibliometrics can benefit from large-scale text analytics and sense mining of papers. The objectives of suchresearch are to provide insights into scientific writing and bring new perspectives to the understanding ofboth the nature of citations and the nature of scientific papers and their internal structures. The possibilityto enrich metadata by the full-text processing of papers offers a new field of investigation, where themajor problems arise around the organization and structure of text, the extraction of information and itsrepresentation at the level of metadata.Recently, the ever growing availability of datasets and papers in full text and in machine-readable formatshas made possible a change in perspective in the field of bibliometrics. From preprint databases to the OpenAccess and the Open Science movements, the development of online platforms such as ArXiv, CiteSeer orPLoS and so forth, largely contribute to facilitating the experimentation with datasets of articles, making itpossible to perform bibliometric studies not only considering the metadata of papers but also their full textcontent.The field of NLP offers methodological frameworks and tools for the full text processing of papers thatcan enlighten bibliometric studies. Some of the open source tools for text processing that have been recentlyapplied to such tasks include NLTK, Mallet, OpenNLP, CoreNLP, Gate, CiteSpace, AllenNLP, andothers. Many datasets are now freely available for the community: e.g. PubMed OA, CiteSeerX, JSTOR,ISTEX, Microsoft Academic Graph, ACL anthology, etc. The further developments in this field of studyneed producing annotated corpora and shared evaluation protocols in order to enable the comparisonbetween different tools and methods. The development of such resources is an important step to makingscientific reproducibility possible.PAPERS IN THIS RESEARCH TOPICThe seven papers published in this Research Topic were all reviewed by two independent reviewers.In the paper ”Is the Abstract a Mere Teaser? Evaluating Generosity of Article Abstracts in theEnvironmental Sciences”, Ermakova et al. (2018) examines the abstracts of scientific papers. In fact,the abstract points out the information that is the most important for the reader and is often used as a proxyfor the content of an article. The authors propose the GEM score that measures the representativeness ofan abstract or its ”generosity”. To obtain this score, sections in the papers were weighted according totheir importance to the reader and sentences in the abstracts were assigned to different sections based on theirsimilarity with the content of the sections. More than 36,000 papers in environmental sciences, retrievedfrom the ISTEX database, were processed to observe the trends in the GEM score over an 80-year periodof time. The results show that abstracts tend to be more generous in recent publications and there seems tobe no correlation between the GEM score and the citation rate of the papers.In the paper ”The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores”, Meyers et al. (2018) propose an open-source high-performing terminology extractionsystem called Termolator which utilizes a combination of knowledge-based and statistical components. TheTermolator tool includes chunking that favors chunks containing out-of-vocabulary words, nominalizations,technical adjectives, and other specialized word classes and supports term chunk ranking. The authorsanalyse the effectiveness of all involved components to the overall system’s performance and comparetheir Termolator system with a terminology extraction system called Termostat. They use a gold standardconsisting of manually annotated instances of inline terms (multi-word nominal expressions) of differenttypes of documents (e.g. patent, journal article).In the paper ”Deep Reference Mining From Scholarly Literature in the Arts and Humanities”,Rodrigues Alves et al. (2018) work on a deep learning architecture for the detection, extraction andclassification of references within the full text of scholarly publications. The authors explore word andcharacter-level word embeddings, different prediction layers (Softmax and Conditional Random Fields)and multi-task over single-task learning components. Their experiments are based on a published datasetof annotated references from a corpus of publications on the historiography of Venice (books and journalarticles in Italian, English, French, German, Spanish and Latin) published from the 19th century to 2014. Inthe evaluation the authors show the relative positive contribution of their character-level word embeddings.The authors release two implementations of the architecture, in Keras and TensorFlow, along with all thedata to train and test. Their results strongly support the adoption of deep learning methods for the generaltask of reference mining.In the paper ”Temporal Representations of Citations for Understanding the Changing Roles of ScientificPublications”, He and Chen (2018) propose an analysis of the temporal characteristics ofcitations in order to represent the dynamic role of scientific publications. For this purpose, they studyand compare different types of citation contexts in order to identify articles that play important role inthe development of science. The proposed methods can have different applications, such as improvingcitation-based techniques at the individual or collective level, but also improving recommendation systemsdedicated to information retrieval by identifying articles of importance or interest.In the paper ”Resolving Citation Links With Neural Networks”, Nomoto (2018) presents a novel way to tackle the citation resolution through the application of neural network models and identifying some of the operational factors that influence their behavior. The author introduces the notion approximately correct targets which is ”an idea that we should treat sentences that occur in the vicinity of true targets as equally correct, whereby we try to identify an area which is likely to include a true target, instead of finding its exact location”. Experiments in the paper are conducted using three datasets developed by the CL-SciSumm Shared Task (ACL repository) and a cross validation style setup.The two papers ”The NLP4NLP Corpus (I and II): 50 Years of Publication, Collaboration and Citationin Speech and Language Processing”, Mariani et al. (2019a,b), present the results of an extensive study ofa dataset in the field of Natural Language Processing (NLP) and Spoken Language Processing (SLP) forthe period 1956-2015. The authors investigate various trends that can be observed from the publications inthis specific research domain. The study is presented in two companion papers that each provides a differentperspective of the analysis. The first paper describes the corpus and presents an overall analysis of thenumber of papers, authors, gender distributions, co-authorship, collaboration patterns and citation patterns.The second paper investigates the research topics and their evolution over time, the key innovative topicsand the authors that introduced them, and also the reuse of papers and plagiarism. Together, the two papersprovide a survey of the literature in NLP and SLP and the data to understand the trends and the evolutionof research in this research community. This study can also be seen as a methodological framework forproducing similar surveys for other scientific areas. The authors report on the major obstacles that appearduring such processing. The first one are the errors that are due to the automatic processing of the full textof papers and in particular scanned content. The second obstacle is the lack of a consistent and uniformidentification of authors, affiliations, conference titles, etc. which all require manual corrections by expertsin the research area that is investigated.CONCLUSIONThe large number of studies on the use of scientific documents with bibliometric applications shows thegrowing interest of the bibliometric community in this subject. Since 2016, we have been maintaining the”Bibliometric-enhanced-IR Bibliography” 1 which is a bibliography of all scientific articles (workshops andjournals) on this Research Topic. In 2018, two special issues closely related to this Research Topic werepublished. The first one is the special issue on ”Bibliometric-enhanced information retrieval and naturallanguage processing for digital libraries (BIRNDL)” in the International Journal on Digital Libraries (Mayret al. (2018)). The second one is ”Bibliometric-enhanced Information retrieval and Scientometrics” inScientometrics (Cabanac et al. (2018)).The articles published in this Research Topic contribute to the state of the artthrough theoretical discoveries, practical methods and technologies for the processing of scientific corpora involving full text processing, classification of citations but also their temporal representation, semantic analysis, text mining, and related topics. Taken together, these papers identify some of the new challenges in this area and pave the way for future theoretical frameworks.The development of deep learning techniques is emerging in this field with approaches based on neuralnetwork models and can play a fundamental role in the exploitation of citations and their contexts inthe scientific literature. While the development of neural network models requires large resources, theincreasing number of datasets that are available today allows the implementation of this type of technologyfor the analysis of citations. Indeed, two of the articles in this Research Topic deal with the implementationof neural network models for citation analysis (Rodrigues Alves et al. (2018); Nomoto (2018)), and othertwo with the construction and exploitation of a large scale corpus of papers (Mariani et al. (2019a,b)).

Loading