Cross-Lingual Plagiarism Detection: Two Are Better Than One

K. Avetisyan, German Gritsay, Andrey V. Grabovoy

Published: 2023, Last Modified: 15 Oct 2024Program. Comput. Softw. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The widespread availability of scientific documents in multiple languages, coupled with the development of automatic translation and editing tools, has created a demand for efficient methods that can detect plagiarism across different languages. In this paper, we present a novel cross-lingual plagiarism detection approach. The algorithm is based on the merger of two existing approaches that in turn achieve state-of-the-art (SOTA) or comparable to SOTA results on different benchmarks. The detailed analysis stages of existing approaches were sequentially merged levelling out the disadvantages of the approaches. The obtained algorithm significantly outperforms the ones it was merged of surpassing them by from 23 to 33% Plagdet Score, depending on different language pairs. The comparison between observed approaches was evaluated on a newly generated multilingual (English, Russian, Spanish, Armenian) test collection, where each suspicious document could contain plagiarised fragments from several languages. The merged method is applicable to various under-resourced languages which is shown on the example of the Armenian language.