Enhanced Biomedical Knowledge Discovery From Unstructured Text Using Contextual EmbeddingsDownload PDF

Anonymous

05 Jun 2022 (modified: 05 May 2023)ACL ARR 2022 June Blind SubmissionReaders: Everyone
Abstract: Extracting knowledge from large, unstructured text corpora presents a challenge. Recently, authors have utilized unsupervised, static word embeddings to uncover "latent knowledge" contained within domain-specific scientific corpora. Here semantic-similarity measures between representations of concepts, objects or entities were used to predict relationships, which were later verified using physical methods. Static language models have recently been surpassed at most downstream tasks by massively pre-trained, contextual language models like BERT. Some have postulated that contextualized embeddings potentially yield word representations superior to static ones for knowledge-discovery purposes. In an effort to address this question, two biomedically-trained BERT models (BioBERT, SciBERT) were used to encode $n$ = 500, 1000 or 5000 sentences containing words of interest extracted from a biomedical corpus (Coronavirus Open Research Dataset). The $n$ representations for the words of interest were subsequently extracted and then aggregated to yield static-equivalent word representations. These words belonged to the vocabularies of intrinsic benchmarking tools for the biomedical domain (Bio-SimVerb and Bio-SimLex), which assess quality of word representations using semantic-similarity and relatedness measures. Using intrinsic benchmarking tasks, feasibility of using contextualized word representations for knowledge discovery tasks can be assessed: Word representations that better encode described reality are expected to perform better (i.e. closer to domain experts). As postulated, BERT embeddings outperform static counterparts at both verb and noun benchmarks, however performance varies by model and neither model outperforms static models at both tasks. Moreover, unique performance characteristics are illustrated when task vocabulary is split between BERT-native words and words requiring sub-word decomposition.
Paper Type: long
0 Replies

Loading