Abstract: Extracting information from large corpora of unstructured text using computational methods presents a challenge. Tshitoyan et al. (2019) demonstrated that unsupervised mathematical word-embeddings produced by a static language model could be utilized to uncover `latent knowledge' within a materials science corpus. The rise of contextualized and massively pre-trained language models like BERT have seen static models becoming surpassed for most NLP tasks. Nevertheless, due to innate architectural and use differences, BERT requires adaptation for knowledge mining. This study tests the suitability of BERT-derived word embeddings for knowledge mining purposes. It utilizes a variation of the approach described by Bommasani et al. (2020) for creating static-equivalent vectors from multiple contextualized word representations. It is conducted using a biomedical corpus, a biomedical BERT variation and validated using domain-specific intrinsic benchmarking tools. Novel, layer-wise BERT performance characteristics are demonstrated. A key finding is that layer-wise intrinsic performance differs for nouns and verbs. Performance also varies according to whether a word of interest belongs to BERT's native vocabulary or is built from sub-word representations: BERT-native representations perform best when extracted from earlier layers, while representations requiring multiple tokens perform best when extracted from the middle-to-latter model layers.
Paper Type: long
0 Replies
Loading