Abstract: Quantifying linguistic diversity in multilingual data sets is important for improving cross-linguistic coverage of NLP models. However, current linguistic diversity scores rely mostly on measures such as the number of languages in the sample, which are not very informative about the structural properties of languages. In this paper, we propose a score derived from the distribution of text statistics (mean word length) as a linguistic attribute suitable for cross-linguistic comparison. We compare NLP data sets (UD, Bible100. mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD) to a new data set designed specifically for the purpose of being typologically representative (WALS-SC). To do so, we apply a version of the Jaccard index ($J_{mm}$) suitable for comparing sets of measures. This diversity score can identify the types of languages that need to be included in multilingual data sets in order to reach broad linguistic coverage. We find, for example, that (poly)synthetic languages are missing in almost all data sets.
0 Replies
Loading