Abstract: The lack of non-English language data in specific fields greatly hinders the creation of Natural Language Processing (NLP) tools beneficial to professionals. This paper introduces TransBERT, a novel framework capable of pre-training a Language Model (LM) using solely synthetically translated text. This study focuses on the French language within the life sciences sector to evaluate the effectiveness of this approach. The research includes a comprehensive statistical approach based on an existing Domain-Specific (DS) benchmark. Alongside a vast corpus of 36.4GB of raw text, featuring 22M translated PubMed abstracts, both a Pre-trained Language Model (PLM) and a tokenizer were trained on the synthetically translated corpus. The model effectively addresses the shortage of DS PLMs for non-English languages, resulting in significant improvements that outperform previous State-of-the-Art (SOTA) models with statistical significance across various downstream tasks, potentially setting a new SOTA in multilingual and DS NLP solutions. The modular architecture of the framework further enables the demonstration of the impact of DS tokenizers in tasks such as NER. The results, corpus, code and models are publicly available to encourage further study in this area.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Language Modeling, Machine Learning for NLP, Machine Translation, Multilingualism and Cross-Lingual NLP, NLP Applications, Resources and Evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English, French
Submission Number: 7144
Loading