Abstract: Several studies have explored various advantages of multilingual pre-trained models (e.g., multilingual BERT) in capturing shared linguistic knowledge. However, their limitations have not been paid enough attention to. In this paper, we investigate the representation degeneration problem and outlier dimensions in multilingual contextual word representations (CWRs) of BERT. We show that though mBERT exhibits no outliers among its representations, its multilingual embedding space is highly anisotropic. Furthermore, our experimental results demonstrate that similarly to their monolingual counterparts, increasing the isotropy of multilingual embedding spaces can significantly improve their representation power and performance. Our analysis indicates that, although the degenerated directions vary in different languages, they encode similar linguistic knowledge, suggesting a shared linguistic space among languages.
0 Replies
Loading