LinkBERT: Pretraining Language Models with Document Links

Michihiro Yasunaga; Jure Leskovec; Percy Liang

LinkBERT: Pretraining Language Models with Document Links

Michihiro Yasunaga, Jure Leskovec, Percy Liang

01 Jun 2022 (modified: 04 May 2025)ICML 2022 Workshop KRLM Readers: Everyone

Keywords: language model, pretraining, knowledge, hyperlink

TL;DR: We propose LinkBERT, a new language model pretraining method that incorporates document link information (e.g. hyperlinks, citation links), and show its strength in acquiring multi-hop knowledge and performing multi-hop reasoning.

Abstract: Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on diverse downstream tasks across two domains: a general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5\% absolute improvement on HotpotQA and TriviaQA), and the biomedical LinkBERT also sets new states of the art on various BioNLP tasks (+7\% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/linkbert-pretraining-language-models-with/code)

0 Replies

Loading