Transformer Architecture Search for Improving Out-of-Domain Generalization in Machine Translation

TMLR Paper3061 Authors

24 Jul 2024 (modified: 17 Nov 2024)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Interest in automatically searching for Transformer neural architectures for machine translation (MT) has been increasing. Current methods show promising results in in-domain settings, where training and test data share the same distribution. However, in real-world MT applications, it is common that the test data has a different distribution than the training data. In these out-of-domain (OOD) situations, Transformer architectures optimized for the linguistic characteristics of the training sentences struggle to produce accurate translations for OOD sentences during testing. To tackle this issue, we propose a multi-level optimization based method to automatically search for neural architectures that possess robust OOD generalization capabilities. During the architecture search process, our method automatically synthesizes high-fidelity OOD MT data, which is used to evaluate and improve the architectures' ability of generalizing to OOD scenarios. The generation of OOD data and the search for optimal architectures are executed in an integrated, end-to-end manner. Evaluated across multiple datasets, our method demonstrates strong OOD generalization performance, surpassing state-of-the-art approaches.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have revised the related work section to highlight the relationships and distinctions between existing methods and our approach. Additionally, we have incorporated discussions on related work using noisy embeddings in Section 2.3. To maintain conciseness in the main text, we have moved discussions on RL-based and EA-based NAS methods to Appendix A. We have added a separate discussion section (Section 5) to discuss in-domain generalization performance, model size, search cost, and provide an analysis of the searched model architecture. Furthermore, we have included Figure 4 to illustrate the search results of baseline methods, with a detailed analysis available in Appendix F. In Appendix G, we provide additional analysis on the generated OOD samples, showing the distribution of local angles for these samples, which is visualized in Figure 5. Additionally, an ablation study investigating the impact of neural architecture search has been included in Appendix H, with results presented in Table 10. Finally, we have performed further experiments to evaluate the out-of-domain generalization performance of our method on high-resource languages, which is discussed in Appendix I, with results shown in Table 11.
Assigned Action Editor: ~Bamdev_Mishra1
Submission Number: 3061
Loading