Creating Corpus for Georgian Language ModellingDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: The effectiveness of modern NLP methods remain contingent upon the availability of extensive and diverse high-quality training datasets. This poses a significant challenge for low-resource languages, among which Georgian stands out as not only low-resource but also remarkably under-researched. In this paper, we address one of the essential elements of this problem - the absence of the well-organized and openly accessible resources for Georgian language modeling. In particular, we introduce a software framework for collecting, cleaning, and organizing data for Georgian LLM training. We also publish an initial version of 37GB of dataset, laying the groundwork for subsequent research in this domain.
Paper Type: short
Research Area: Resources and Evaluation
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Georgian
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview