MicroBERT-MT: Machine Translation as an Auxiliary Pretraining Task for Low-resource Monolingual Encoders

Phakphum Artkaew; Luke Gessler

MicroBERT-MT: Machine Translation as an Auxiliary Pretraining Task for Low-resource Monolingual Encoders

Phakphum Artkaew, Luke Gessler

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: low-resource NLP, machine translation, language model pretraining, auxiliary tasks

TL;DR: Investigating machine translation as an auxiliary pretraining task for improving low-resource monolingual encoders.

Abstract: Transformer-based language models \cite{devlin2019bert,vaswani2017attention} are central to modern NLP, but their performance depends heavily on large-scale pretraining data \cite{kaplan2020scaling,hoffmann2022training}. This poses challenges for low-resource languages, which are severely underrepresented in existing corpora \cite{joshi-etal-2020-state}, and may not be adequately served by multilingual models. \citet{gessler-zeldes-2022-microbert} demonstrated that for languages with <10M tokens of pretraining data, fine-tuning multilingual models usually yields poor results. Instead, they demonstrated that training a monolingual BERT that is substantially smaller than usual ($\approx$1M parameters, instead of BERT base's 110M) is enough to yield much better performance. Moreover, in experiments with two auxiliary supervised tasks for pretraining, they showed that pretraining with part-of-speech tagging and Universal Dependencies parsing could enhance model quality further. In this work, we extend the MicroBERT approach, investigating whether machine translation (MT) can serve as a useful multitask learning task during pretraining of small monolingual BERTs under severe data constraints. At least a small amount of parallel data with English or another language of broad communication is typically available even for low-resource languages. We hypothesize that by treating our encoder as (in part) an MT encoder and performing both masked language modeling (MLM) and MT, we can obtain a higher-quality pretrained encoder, especially if the MT decoder is pretrained. We investigate both randomly initialized and pretrained MT decoders and evaluate the pretrained encoder on two downstream tasks: named entity recognition and Universal Dependencies parsing. We evaluate our approach on six low-resource languages: Wolof, Coptic, Maltese, Tamil, Indonesian, and Uyghur. For each language, we use the same data used by \citet{gessler-zeldes-2022-microbert} and additionally incorporate parallel English data drawn from several established resources \cite{burchell-etal-2025-expanded, tiedemann-2012-parallel, schwenk-etal-2021-ccmatrix, fan2021beyond}. Our preliminary results indicate that MT-based pretraining outperforms vanilla MLM in most cases, with some languages benefiting much more than others. We discuss what these results suggest about the utility of parallel corpora as a source of pretraining signal for low-resource languages.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 58

Loading