Knowledge-Centric Data Selection for Effective Domain Adaptation of Language Models

ICLR 2026 Conference Submission19582 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge-Centric Data Selection, Data Efficiency, Revenue Boundary, Informativeness, Redundancy Reduction, Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG)
TL;DR: We introduce a knowledge-centric method to select compact, high-quality data for efficient domain adaptation, improving performance and reducing curation costs in SFT and RAG.
Abstract: Data filtering remains a fundamental challenge in training large language models (LLMs). Large-scale corpora often contain noisy or redundant samples that undermine training efficiency and limit model performance. Existing approaches typically rely on manual curation or heuristic model-based filtering, yet lack a principled and widely accepted quantitative criterion for deciding which data should be retained. To address this gap, we propose an entropy-based data filtering framework that quantitatively evaluates the informativeness and coverage of individual samples. Our method enables the systematic selection of high-value data, improving the efficiency of supervised fine-tuning (SFT) while also enhancing retrieval quality in retrieval-augmented generation (RAG). These results highlight the effectiveness of entropy-driven filtering as a general strategy for improving both adaptation and retrieval in large-scale LLM pipelines.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19582
Loading