Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment

Published: 01 Dec 2025, Last Modified: 24 May 2026Automatic Control and Computer SciencesEveryoneRevisionsCC BY-SA 4.0
Abstract: The task of assessing the complexity of a text is a relevant applied problem with potential application in drafting legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text’s complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this study, we examine three groups of features: (1) automatically generated keywords, (2) information about the features of morphemic word parsing, and (3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm is utilized to generate keywords, a convolutional neural network model is used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, is used to generate syntax trees. We conduct a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and annotation paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text’s complexity. The use of keywords perform worse on average than the use of topic markers obtained using the latent Dirichlet allocation (LDA). In most situations, morphemic characteristics turn out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allows, in most cases, improving the quality of the work of neural network models in comparison with the previously described set.
Loading