Data Scaling Laws in NMT: The Effect of Noise and Architecture

Yamini Bansal; Behrooz Ghorbani; Ankush Garg; Biao Zhang; Colin Cherry; Maxim Krikun; Behnam Neyshabur; Orhan Firat

Data Scaling Laws in NMT: The Effect of Noise and Architecture

Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Colin Cherry, Maxim Krikun, Behnam Neyshabur, Orhan Firat

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: Scaling laws, Neural Machine Translation

Abstract: In this work, we empirically study the data scaling properties of neural machine translation (NMT). We first establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. We then systematically vary various aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the (1) Architecture and task setup, to a Transformer-LSTM Hybrid as well as a Decoder-only transformer with language modeling loss (2) Noise level in the training distribution, starting with noisy data with filtering applied as well as clean data corrupted with synthetic iid noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data quality can be compensated for by adding more data. Lastly, we find that changing the training distribution to use back-translated data instead of parallel data, can impact the scaling exponent.

One-sentence Summary: We study the effect of changing architecture and training distribution noise levels on the data scaling laws for NMT.

13 Replies

Loading