Abstract: In this article, we report the construction of a web-based Galician corpus and its language model, both made publicly available, by making use of CCNet tools and data. An in-depth analysis of the corpus is made so as to provide insights on how to achieve optimum quality through the use of heuristics to lower the perplexity.
0 Replies
Loading