LORuGEC: the Linguistically Oriented Rule-annotated corpus for Grammatical Error Correction of Russian

ACL ARR 2025 February Submission1143 Authors

12 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We release LORuGEC -- the first rule-annotated corpus for Russian grammatical error correction. The sentences in it are accompanied with the grammar rules governing their spelling. In total, we collected $48$ rules with $348$ sentences for validation and $612$ for testing. LORuGEC occurs to be challenging for open-source LLMs: the best F0.5-score is achieved by Qwen2.5-7B and is only 44\%. The closed YandexGPT4 Pro model achieves the score of 73\%. Using a rule-informed retriever for fewshot example selection, we improve these scores up to $56\%$ for Qwen and $80\%$ for YandexGPT.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation; benchmarking; language resources; Grammar and knowledge-based approaches
Contribution Types: Data resources
Languages Studied: Russian
Submission Number: 1143
Loading