Classification of text fragments in available anti-plagiarism tools without access to the source file

ACL ARR 2026 January Submission4019 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diploma theses documents, document layout understanding, text classification
Abstract: As part of the development of the Unified Anti-Plagiarism System (JSA), a polish nationwide platform for detecting plagiarism in theses and other academic documents, research was conducted to improve the text extraction process. JSA operates solely on text content extracted from documents, without access to the original source files, preventing multi-modal approaches based on document layout. As a result, a new method was developed, which allows for the identification of fragment types based on character string analysis.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: educational applications
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: polish
Submission Number: 4019
Loading