Fine-tuning Language Models for Factuality

Published: 28 Oct 2023, Last Modified: 26 Nov 2023Instruction Workshop @ NeurIPS 2023EveryoneRevisionsBibTeX
Keywords: factuality, hallucination, language model, dpo
TL;DR: We show that language models can be fine-tuned to significantly reduce their rate of making factual errors, without needing human labels.
Abstract: The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. However, language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations', which can harmfully perpetuate myths and misconceptions. Further, manual fact-checking of model responses is a time-consuming process, making human factuality labels expensive to acquire. In this work, we leverage two key recent innovations in NLP to fine-tune language models to be more factual without human labeling, targeting more open-ended generation settings than past work. First, several recent works have proposed methods for scoring the factuality of open-ended text derived from consistency with an external knowledge base or simply a large model's confidence scores. Second, the Direct Preference Optimization algorithm enables straightforward fine-tuning of language models on objectives other than supervised imitation, using a preference ranking over possible model responses. We show that learning from preference rankings generated by either automated criterion significantly improves the factuality of Llama-2 on held-out topics (percent of generated claims that are correct) compared with existing RLHF procedures or decoding strategies targeted at factuality, showing over 50% and 20--30% error reduction for biographies and medical questions respectively.
Submission Number: 41
Loading