TransBERT: A Synthetically Translated Language Model

TransBERT: A Synthetically Translated Language Model

ACL ARR 2025 February Submission7144 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The lack of non-English language data in specific fields greatly hinders the creation of Natural Language Processing (NLP) tools beneficial to professionals. This paper introduces TransBERT, a novel framework capable of pre-training a Language Model (LM) using solely synthetically translated text. This study focuses on the French language within the life sciences sector to evaluate the effectiveness of this approach. The research includes a comprehensive statistical approach based on an existing Domain-Specific (DS) benchmark. Alongside a vast corpus of 36.4GB of raw text, featuring 22M translated PubMed abstracts, both a Pre-trained Language Model (PLM) and a tokenizer were trained on the synthetically translated corpus. The model effectively addresses the shortage of DS PLMs for non-English languages, resulting in significant improvements that outperform previous State-of-the-Art (SOTA) models with statistical significance across various downstream tasks, potentially setting a new SOTA in multilingual and DS NLP solutions. The modular architecture of the framework further enables the demonstration of the impact of DS tokenizers in tasks such as NER. The results, corpus, code and models are publicly available to encourage further study in this area.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Language Modeling, Machine Learning for NLP, Machine Translation, Multilingualism and Cross-Lingual NLP, NLP Applications, Resources and Evaluation

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, French

Submission Number: 7144

Loading