Abstract: Training large language models is challenging when data availability is limited, as it is the case for low-resource languages. We investigate different data augmentation techniques for the training of models on Luxembourgish, a low-resource language. We leverage various word substitution methods for artificially increasing textual data: synonym replacements, entity replacements and modal verbs replacements. We present DA BERT and LuxemBERT-v2, two BERT models for the Luxembourgish language. We evaluate our models on several downstream tasks and conduct an ablation study to assess the impact of each replacement method. Our work provides valuable insights and highlights the importance of finding solutions to training models in low-resource settings.
0 Replies
Loading