PrefExpert: Preference-Aligned Parameter Editing for Mitigating Hallucinations in Large Language Models

ACL ARR 2025 February Submission4213 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have achieved remarkable breakthroughs, being increasingly applied across multiple domains. However, their tendency to generate inaccurate or fabricated information, known as hallucination, remains a significant challenge, undermining their reliability. In this paper, we propose a novel preference-aligned parameter editing paradigm by constructing a PrefExpert, which dominates LLM behavior to enhance factual accuracy and truthfulness. Specifically, our approach first fine-tunes the backbone model on factual and hallucinated data, respectively, yielding expert and anti-expert models. We, subsequently, conduct parameter editing based on preference alignment, which integrates the fine-tuned expert and anti-expert models with preference optimization. Particularly, the learnable preference parameters are optimized by the proposed implicit reward model. To the best of our knowledge, this is $\textit{the first work}$ of conceptualizing preference expert to handle hallucinations. Sufficient experiments across factuality, truthfulness, and toxicity benchmarks demonstrate that our PrefExpert significantly outperforms existing parameter editing methods, reducing toxic ratio to 2.0\% and 3.5\%.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: bias/toxicity,factuality
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4213
Loading