Keywords: Scaling laws, Neural Machine Translation
Abstract: In this work, we empirically study the data scaling properties of neural machine translation (NMT). We first establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. We then systematically vary various aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the (1) Architecture and task setup, to a Transformer-LSTM Hybrid as well as a Decoder-only transformer with language modeling loss (2) Noise level in the training distribution, starting with noisy data with filtering applied as well as clean data corrupted with synthetic iid noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data quality can be compensated for by adding more data. Lastly, we find that changing the training distribution to use back-translated data instead of parallel data, can impact the scaling exponent.
One-sentence Summary: We study the effect of changing architecture and training distribution noise levels on the data scaling laws for NMT.