Abstract: Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models.
Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback.
In particular, we train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools, via data augmentation on a combination of public judgment datasets.
We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data.
Specifically, we generate a set of candidate responses, ask FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses.
Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact.
With FenCE, we improve Llama3-8B-chat's factuality rate by 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 6.96%.
Paper Type: Long
Research Area: Generation
Research Area Keywords: automatic evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 6547
Loading