The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation

Published: 27 Sept 2024, Last Modified: 27 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i)~parallel word frequency distributions, (ii)~partially shared vocabulary, and (iii)~similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: 1) We added a discussion subsection (10.4) on the lexical field experiment discussing the assumption of a unique lexical field per sentence as well as the other methods we tried to introduce semantics dependencies. 2) We added examples to better explain the experiment of Section 10. 3) We added mapping examples between our artificial languages and natural languages (4.1). 4) We discussed BLEU alternatives (4.4). 5) We addressed some details about the experimental setup: number of sentences (4.2), vocabulary size (4.1), cross-lingual embeddings (4.2).
Assigned Action Editor: ~Alessandro_Sordoni1
Submission Number: 2376
Loading