A method for Automatic Text Summarization based on Rhetorical Analysis and Topic Modeling

Tatiana Batura, Aigerim Bakiyeva, Maria Charintseva

Published: 2020, Last Modified: 12 Aug 2024Int. J. Comput. 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This article describes the original method of automatic summarization of scientific and technical texts based on rhetorical analysis and using topic modeling. The proposed method combines the use of a linguistic knowledge base and machine learning. For the detection of key terms, we used topic modeling. First, unigram topic models containing only one-word terms are constructed. Further, these models are extended by adding multiword terms. The most significant fragments of the original document are determined in the process of rhetorical analysis with the help of discursive markers. When evaluating the importance of text fragments, keywords, multiword terms, and scientific lexicon characterizing scientific and technical texts are also taken into account. A linguistic knowledge base has been created to store information about the markers and scientific lexicon. The experiments showed that this method is effective, needs a comparatively small amount of training data and can be adapted to processing texts of different subject fields in other languages.