Sequence-to-Sequence Multilingual Pre-Trained Models: A Hope for Low-Resource Language Translation?Download PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: We investigate the capability of mBART, a sequence-to-sequence multilingual pre-trained model in translating low-resource languages under five factors: the amount of data used in pre-training the original model, the amount of data used in fine-tuning, the noisiness of the data used for fine-tuning, the domain-relatedness between the pre-training, fine-tuning, and testing datasets, and the language relatedness. When limited parallel corpora are available, fine-tuning mBART can measurably improve translation performance over training Transformers from scratch. mBART effectively uses even domain-mismatched text, suggesting that mBART can learn meaningful representations when data is scarce. Still, it founders when too-small data in unseen languages is provided.
0 Replies

Loading