MicroBERT-MT: Machine Translation as an Auxiliary Pretraining Task for Low-resource Monolingual Encoders
Keywords: low-resource NLP, machine translation, language model pretraining, auxiliary tasks
TL;DR: Investigating machine translation as an auxiliary pretraining task for improving low-resource monolingual encoders.
Abstract: Transformer-based language models \cite{devlin2019bert,vaswani2017attention} are central to modern NLP, but their performance depends heavily on large-scale pretraining data \cite{kaplan2020scaling,hoffmann2022training}.
This poses challenges for low-resource languages, which are severely underrepresented in existing corpora \cite{joshi-etal-2020-state}, and may not be adequately served by multilingual models.
\citet{gessler-zeldes-2022-microbert} demonstrated that for languages with <10M tokens of pretraining data, fine-tuning multilingual models usually yields poor results.
Instead, they demonstrated that training a monolingual BERT that is substantially smaller than usual ($\approx$1M parameters, instead of BERT base's 110M) is enough to yield much better performance.
Moreover, in experiments with two auxiliary supervised tasks for pretraining, they showed that pretraining with part-of-speech tagging and Universal Dependencies parsing could enhance model quality further.
In this work, we extend the MicroBERT approach, investigating whether machine translation (MT) can serve as a useful multitask learning task during pretraining of small monolingual BERTs under severe data constraints.
At least a small amount of parallel data with English or another language of broad communication is typically available even for low-resource languages.
We hypothesize that by treating our encoder as (in part) an MT encoder and performing both masked language modeling (MLM) and MT, we can obtain a higher-quality pretrained encoder, especially if the MT decoder is pretrained.
We investigate both randomly initialized and pretrained MT decoders and evaluate the pretrained encoder on two downstream tasks: named entity recognition and Universal Dependencies parsing.
We evaluate our approach on six low-resource languages: Wolof, Coptic, Maltese, Tamil, Indonesian, and Uyghur.
For each language, we use the same data used by \citet{gessler-zeldes-2022-microbert} and additionally incorporate parallel English data drawn from several established resources \cite{burchell-etal-2025-expanded, tiedemann-2012-parallel, schwenk-etal-2021-ccmatrix, fan2021beyond}.
Our preliminary results indicate that MT-based pretraining outperforms vanilla MLM in most cases, with some languages benefiting much more than others.
We discuss what these results suggest about the utility of parallel corpora as a source of pretraining signal for low-resource languages.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 58
Loading