Abstract: Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs on a set of factual knowledge makes them produce better answers to queries from a different set, thus making finetuned LMs a good candidate for knowledge extraction and, consequently, knowledge graph construction. In this paper, we analyze finetuned LMs for factual knowledge extraction. We show that along with its previously known positive effects, finetuning also leads to a (potentially harmful )phenomenon which we call Frequency Shock, where at the test time the model over-predicts rare entities that appear in the training set and under-predicts common entities that do not appear in the training set enough times. We show that Frequency Shock leads to a degradation in the predictions of the model and beyond a point, the harm from Frequency Shock can even outweigh the positive effects of finetuning, making finetuning harmful overall. We then consider two solutions to remedy the identified negative effect: 1- model mixing and 2- mixture finetuning with the LM’s pre-training task. The two solutions combined lead to significant improvements compared to vanilla finetuning.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Thanks to the constructive comments from the reviewers, we made a number of changes in our revision outlined below (all the main updates are highlighted in blue for the convenience of the reviewers):
* Improved the structure of the paper by moving content that interrupted the flow of the paper into a separate discussion section.
* Centered the paper around "frequency shock" and made "range shift" a sub-effect.
* Added a section in the appendix that explains different models in detail
* Added Figure 3(b) for the effect of finetuning on all three datasets
* Explained in more detail how/why model mixing and mixture training help alleviate frequency shock
* Expanded on the contribution of Wallat et al.
* Made the knowledge extraction recipe more aligned with the expected amount of frequency mismatch
* Added a high-level summary of the main findings of the paper in the discussion section
* Added a paragraph connecting our contribution to bias amplification in the discussion section.
* Added a paragraph connecting our contribution to the domain adaptation literature.
* Fixed typos and citation formatting issues
* Removed the knowledge extraction recipe subsection
With the above changes, we believe our revised version has substantially improved in quality and we would like to thank the reviewers again for valuable feedback.
Assigned Action Editor: ~Tao_Qin1
Submission Number: 805
Loading