Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning

Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning

ACL ARR 2025 February Submission3752 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Domain-specific Instruction-tuning (IT) has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the IT dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific IT. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs' pretrained knowledge and context knowledge of IT data, which could damage LLMs' prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific IT data that meets LLMs' actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. Taking the medical IT as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: data-efficient training, data selection, NLP in resource-constrained settings

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 3752

Loading