Text Similarity Detection Using Machine Learning Algorithms with Character-Based Similarity Measures

Emil Kalbaliyev, Samir Rustamov

Published: 01 Jan 2020, Last Modified: 10 Oct 2024MIDI 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text similarity detection is one of the significant research problems in the Natural Language Processing field. In this paper, we propose an approach that uses machine learning models with seven character-based similarity measures to classify texts based on similarity. For this purpose, we use character-based similarity measures—Longest Common Substring, Longest Common Subsequence, Ratcliff/Obershelp algorithms, Jaro, Jaro–Winkler, Levenshtein, and Damerau-Levenshtein distances as input of supervised machine learning algorithms. For the similarity detection task, news articles are collected from Azerbaijani news websites and 9600 text pairs are created and manually labeled as similar and non-similar. These text pairs are processed by similarity measures to feed Machine learning algorithms—Support Vector Machine, Random Forest and Multi-layer Perceptron Neural Network. We performed a 10-fold cross-validation process on the dataset and found that the trained Neural Networks model gives the best mean accuracy (96%) in detecting similarity between two text bodies. We demonstrated that our proposed method outperforms results gained from individual character-based similarity measurement.