Abstract: The effectiveness of modern NLP methods remain contingent upon the availability of extensive and diverse high-quality training datasets. This poses a significant challenge for low-resource languages, among which Georgian stands out as not only low-resource but also remarkably under-researched. In this paper, we address one of the essential elements of this problem - the absence of the well-organized and openly accessible resources for Georgian language modeling. In particular, we introduce a software framework for collecting, cleaning, and organizing data for Georgian LLM training. We also publish an initial version of 37GB of dataset, laying the groundwork for subsequent research in this domain.
Paper Type: short
Research Area: Resources and Evaluation
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Georgian
0 Replies
Loading