Abstract: Semitic morphologically-rich languages (MRLs) are plagued by word ambiguity; in a standard text, many (and often most) of the words will be homographs with multiple possible analyses. Previous research on MRLs claimed that standardly trained contextualized embeddings based on word-pieces do not sufficiently capture the internal structure of words with hugely ambiguous homographs. Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated using contextualized embeddings. We evaluate all existing models for contextualized Hebrew embeddings on 75 Hebrew homograph challenge sets. Our empirical results demonstrate that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings; they are most effective for disambiguation of segmentation and morphological features, less so regarding pure sense disambiguation. We show that these embeddings are more effective when the number of word-piece splits is limited, and they are more effective for 2-way and 3-way ambiguities than for 4-way ambiguity. We show that the embeddings are equally effective for homographs of both balanced and skewed distributions. Finally, we show that these embeddings are as effective for homograph disambiguation with extensive supervised training as with a few-shot setup.
Paper Type: short
0 Replies
Loading