Low-resource Bilingual Dialect Lexicon Induction with Large Language ModelsDownload PDF

Published: 20 Mar 2023, Last Modified: 18 Apr 2023NoDaLiDa 2023Readers: Everyone
Keywords: German dialects, large language models, bitext mining, bilingual lexicon induction
TL;DR: We apply large language models to low-resource German dialects to mine bitext and induce bilingual lexicons and conduct human evaluation of the outputs. The results are promising. We release code and experimental data for further uptake.
Abstract: Bilingual word lexicons map words in one language to their synonyms in another language. Numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, framing a typical pipeline that consists of two steps: (i) unsupervised bitext mining and (ii) unsupervised word alignment. At the core of those steps are pre-trained large language models (LLMs). In this paper we present the analysis of the BLI pipeline for German and two of its dialects, Bavarian and Alemannic. This setup poses a number of unique challenges, attributed to the scarceness of resources, relatedness of the languages and lack of standardization in the orthography of dialects. We analyze the BLI outputs with respect to word frequency and the pairwise edit distance. Finally, we release an evaluation dataset consisting of manual annotations for 1K bilingual word pairs labeled according to their semantic similarity.
3 Replies

Loading