Rail transit fault text classification based on the latent dirichlet allocation

Ruoqing Li; Shuai Su; Guang Wang; Jia Qu; Yuan Cao

Rail transit fault text classification based on the latent dirichlet allocation

Ruoqing Li, Shuai Su, Guang Wang, Jia Qu, Yuan Cao

Published: 01 Jan 2021, Last Modified: 05 Aug 2024ITSC 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: A large amount of text data recorded in rail transit fault diagnosis has not been well utilized at present. Designing a reasonable algorithm to classify fault text is of great significance to unified management and scheduling decision-making. However, the current text classification rarely considers the potential meaning of the text, and it is not effective in dealing with synonyms. In practice, the fault text can be regarded as unbalanced small sample data. The effect of support vector machine (SVM) in this type of data is considerable. Given on this, this paper proposes a SVM classification strategy based on latent dirichlet allocation (LDA), which takes the topic information of the text into account. Initially, the term frequency-inverse document frequency (TF-IDF) index of the word is used to judge the correlation between words in the document and fault categories. Additionally, the correlation is combined with the gibbs sampling in the LDA model training process to infer the topic distribution of the text. Eventually, with the relationship between the topic and the fault category, unique classifiers are set for each topic by clustering. The model is validated with the fault data of the Beijing metro from 2017 to 2020, which shows that the proposed approach outperforms traditional approaches.

Loading