Frustratingly Simple Regularization to Improve Zero-shot Cross-lingual RobustnessDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Large-scale multilingual pretrained encoders, such as mBERT and XLM-R, have demonstrated impressive zero-shot cross-lingual transfer capability across multiple NLP tasks. However, as we show in this paper, these models suffer from two major problems: (1) degradation in zero-shot cross-lingual performance after fine-tuning on a single language, and (2) cross-lingual performance sensitivity to fine-tuning hyperparameters. In order to address these issues, we evaluate two techniques during fine-tuning, namely, Elastic Weight Consolidation (EWC) and L2-distance regularization to assist the multilingual models in retaining their cross-lingual ability after being fine-tuned on a single language. We compare zero-shot cross-lingual performance of mBERT with/without regularization on four different tasks: XNLI, PANX, UDPOS and PAWSX and demonstrate that the model fine-tuned with L2-distance regularization performs better than its vanilla fine-tuned counterpart in zero-shot setting across all the tasks by up to 1.64%. Moreover, by fine-tuning mBERT with different hyperparameter settings on the specified tasks, we demonstrate that L2-distance regularization also makes fine-tuning more robust, reducing standard deviation of zero-shot results by up to 87%. Based on our experiments, EWC does not provide consistent improvements across languages. Moreover, to test if additional constraint on the encoder parameters would improve the results further, we compared L2-distance regularization with techniques that freeze most of the encoder parameters during fine-tuning, such as bitfit, soft prompting, and adapter-based methods. However, we observe that L2-distance regularization still performs the best.
Paper Type: long
0 Replies

Loading