Keywords: Theory, unsupervised learning, machine translation
TL;DR: An information-theoretic framework for unsupervised translation, paired with sample-complexity upper bounds and a probabilistic model of language.
Abstract: Unsupervised translation refers to the challenging task of translating between two languages without parallel translations, i.e., from two separate monolingual corpora without a Rosetta stone. We propose an information-theoretic framework of unsupervised translation that models the case where the source language is that of highly intelligent animals, such as whales, and the target language is a human language, such as English. In particular, there may be limited quantities of source data, the source and target languages may be quite different in nature, and few assumptions are made on the source language syntax. We apply our theory to a stylized setting of tree-based languages. Our analysis suggests that the amount of source data required for unsupervised translation is not significantly more than that of supervised translation. Our analysis is purely information-theoretic; issues of algorithmic efficiency are left for future work. We are motivated by an ambitious initiative to translate whale communication using modern machine translation techniques. The recordings of whale communication that are being collected have no parallel human-language data.
In-person Presentation: yes