Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect

Published: 01 Jan 2020, Last Modified: 12 Mar 2025LREC 2020EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We present in this work a universal, character-based method for representing sentences so that one can thereby calculate the distance between any two sentence pair. With a small alphabet, it can function as a proxy of phonemes, and as one of its main uses, we carry out dialect clustering: cluster a dialect/sub-language mixed corpus into sub-groups and see if they coincide with the conventional boundaries of dialects and sub-languages. By using data with multiple Japanese dialects and multiple Slavic languages, we report how well each group clusters, in a manner to partially respond to the question of what separates languages from dialects.
Loading