Supplementing Domain Knowledge to BERT with Semi-structured Information of Documents

Chen Jing; Zhihua Wei; Jiaqi Wang; Rui Wang; Chuanyang Gong

Supplementing Domain Knowledge to BERT with Semi-structured Information of Documents

Chen Jing, Zhihua Wei, Jiaqi Wang, Rui Wang, Chuanyang Gong

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Domain adaptation, Semi-structured information, BERT, Pre-trained language model, Biomedical question answering

TL;DR: A new domain adaptation method is proposed, which emphasize the importance of semi-structured information of documents for BERT capturing domain knowledge.

Abstract: Adapting BERT on in-domain text corpus is a good way to boost its performance on domain-specific natural language processing (NLP) tasks. Common domain adaptation methods, however, can be deficient in capturing domain knowledge. Meanwhile, the context fragmentation inherent in Transformer-based models also hinders the acquisition of domain knowledge. Given the semi-structural characteristics of documents and their potential for alleviating these problems, we leverage semi-structured information of documents to supplement domain knowledge to BERT. To this end, we propose a topic-based domain adaptation method, which enhances the capture of domain knowledge at various levels of text granularity. Specifically, topic masked language model is designed at the paragraph level for pre-training; topic subsection matching degree dataset is automatically constructed at the subsection level for intermediate fine-tuning. Experiments are conducted over three biomedical NLP tasks across five datasets, and the results highlight the importance of the previously overlooked semi-structured information for domain adaptation. Our method benefits BERT, RoBERTa, BioBERT, and PubMedBERT in nearly all cases and yield significant gains over the topic-related task, question answering, with an average accuracy improvement of 4.8.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Machine Learning for Sciences (eg biology, physics, health sciences, social sciences, climate/sustainability )

5 Replies

Loading