Overcoming Language Barriers in Classification through Translation-Augmented Data

Noémi Veres, Vlad-Andrei Negru, Sebastian-Antonio Toma, Camelia Lemnaru, Rodica Potolea

Published: 2024, Last Modified: 26 May 2026ICCP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper addresses the challenge of adapting a multilingual language model to an unseen language by using translations to augment the training dataset. We evaluate how fine-tuning a model with Hungarian data obtained through translations impacts performance on several classification tasks. Our experiments demonstrate that translating recipes into Hungarian and using them for training significantly improves the model's accuracy, with the best results achieved using only 35% of the total Hungarian recipes available. Additionally, fine-tuning on multilingual data before incorporating Hungarian leads to better performance compared to direct fine-tuning on Hungarian alone. Our approach has an overall increase in accuracy of 9.37% for three classification tasks. These findings highlight the effectiveness of using translations and multilingual models to enhance performance in low-resource languages in particular, where collecting labeled data for training is challenging and expensive.

External IDs:dblp:conf/iccp2/VeresNTLP24