Creating Corpus for Georgian Language Modelling

Anonymous

Creating Corpus for Georgian Language Modelling

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: The effectiveness of modern NLP methods remain contingent upon the availability of extensive and diverse high-quality training datasets. This poses a significant challenge for low-resource languages, among which Georgian stands out as not only low-resource but also remarkably under-researched. In this paper, we address one of the essential elements of this problem - the absence of the well-organized and openly accessible resources for Georgian language modeling. In particular, we introduce a software framework for collecting, cleaning, and organizing data for Georgian LLM training. We also publish an initial version of 37GB of dataset, laying the groundwork for subsequent research in this domain.

Paper Type: short

Research Area: Resources and Evaluation

Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: Georgian

0 Replies

Loading