Transformer Architecture Search for Improving Out-of-Domain Generalization in Machine Translation

Published: 26 Dec 2024, Last Modified: 26 Dec 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Interest in automatically searching for Transformer neural architectures for machine translation (MT) has been increasing. Current methods show promising results in in-domain settings, where training and test data share the same distribution. However, in real-world MT applications, it is common that the test data has a different distribution than the training data. In these out-of-domain (OOD) situations, Transformer architectures optimized for the linguistic characteristics of the training sentences struggle to produce accurate translations for OOD sentences during testing. To tackle this issue, we propose a multi-level optimization based method to automatically search for neural architectures that possess robust OOD generalization capabilities. During the architecture search process, our method automatically synthesizes approximated OOD MT data, which is used to evaluate and improve the architectures' ability of generalizing to OOD scenarios. The generation of approximated OOD data and the search for optimal architectures are executed in an integrated, end-to-end manner. Evaluated across multiple datasets, our method demonstrates strong OOD generalization performance, surpassing state-of-the-art approaches.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. We have incorporated additional results on out-of-domain generalization across multiple new settings, including En-Ga (Flores), En-De (WMT-Chat), En-De (WMT-Biomedical), En-Fr (WMT-Chat), and En-Cs (WMT-Biomedical). These results, covering all five methods (Transformer, DARTS, Ours-darts, PDARTS, Ours-pdarts), have been added to Table 1 in the revised manuscript. Corresponding updates were made to Sections 4.1, 4.4, and Appendix D to describe the new datasets and experimental settings used for these additional results. These additional experiments solidify the contributions of our paper in out-of-domain generalization. 2. The manuscript has been extensively revised to de-emphasize the role of generated "out-of-domain" data in the second stage as a primary contribution. Instead, we emphasize that the core contribution of this work lies in the learned architecture’s ability to generalize effectively to out-of-domain test datasets, as demonstrated in the experimental results, particularly in Table 1. To achieve this, we generate an additional MT dataset that closely approximates OOD data, to train an architecture with robust OOD generalization capabilities. In the revised text, we refer to the generated data interchangeably as "approximated OOD data" or "synthetic OOD data" to clearly distinguish it from genuine OOD data. 3. We have added further discussion in Appendix C and revised Section 3.6 to clarify key properties of our optimization algorithm. To sum up, while our optimization algorithm does not guarantee convergence to the global optimal architecture $A^*$, the approximated $A’\approx A^*$ learnt from our algorithm still exhibits strong generalizability. This is supported by both our experimental results and prior literature [1-2], and it is sufficient within the scope of this paper to achieve superior out-of-domain performance. [1] Sharp Minima Can Generalize For Deep Nets, Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio, ICML 2017. [2] Stability and Generalization of Bilevel Programming in Hyperparameter Optimization, Fan Bao, Guoqiang Wu, Chongxuan Li, Jun Zhu, Bo Zhang, NeurIPS 2021.
Code: https://github.com/yihenghe/transformer_nas
Assigned Action Editor: ~Bamdev_Mishra1
Submission Number: 3061
Loading