Enhancing Parameter-Efficient Transformers with Contrastive Syntax and Regularized Dropout for Neural Machine Translation

Yunlong Fan, Zhiheng Yang, Baixuan Li, Zhiqiang Gao

Published: 01 Jan 2024, Last Modified: 18 May 2025PRICAI (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Transformers have significantly improved Neural Machine Translation (NMT) models, accompanied by the inherent space complexity of \(O(n^{2})\). While recent approaches aim to be parameter-efficient, they often exhibit limited generalization capabilities with fewer parameters and decreased performances in longer sentences. To this end, we propose two methods, Syntax-enhanced Contrastive Learning (Syn-CL) and JS divergence-based Regularized Dropout (JSR-Drop) for the cross-layer parameter-sharing Transformer (baseline) in NMT. In Syn-CL, we add the corresponding target-to-source instances for the same batch in an extra bidirectional training phase, and minimize the distances of syntax-enhanced representation between the bilingual sentence-pair while maximizing the ones among others. In JSR-Drop, we refine the regularized dropout strategy with JS divergence, and improve generalization capabilities by reducing Transformers’ inconsistencies between training and inference introduced by dropout. Extensive experiments on six NMT tasks, IWSLT2014 (German\(\leftrightarrow \)English) and IWSLT2017 (English\(\leftrightarrow \)French and English\(\leftrightarrow \)Romanian), show an average BLEU score of 37.55, surpassing the vanilla Transformer and the baseline up to 1.75 BLEU, and outperforming models with 1.5\(\times \), 2.0\(\times \) parameters in translating long sentences. Our code is available at https://github.com/Anon1214/EnPeNMT.