Abstract: Currently, guidance around selection of an optimal or appropriate subword vocabulary size is incomplete and confusing at best. Using a measure of subword-morpheme overlap, our analysis shows that one can find a "sweet spot" for a morphology informed subword vocabulary size. This sweet spot exhibits some variation with respect to text complexity and the morphological characteristics of a language. However, it is relatively constant with respect to corpus size.
0 Replies
Loading