Processing Comparable Corpora With Bilingual Suffix Trees

Dragos Stefan Munteanu, Daniel Marcu

2002 (modified: 16 Jul 2019)EMNLP 2002Readers: Everyone

Abstract: We introduce Bilingual Suffix Trees (BST), a data structure that is suitable for exploiting comparable corpora. We discuss algorithms that use BSTs in order to create parallel corpora and learn translations of unseen words from comparable corpora. Starting with a small bilingual dictionary that was derived automatically from a corpus of 5.000 parallel sentences, we have automatically extracted a corpus of 33.926 parallel phrases of size greater than 3, and learned 9 new word translations from a comparable corpus of 1.3M words (100.000 sentences).

0 Replies