Task-Aware Data Selection via Proxy-Label Enhanced Distribution Matching for LLM Finetuning

ICLR 2026 Conference Submission11624 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Selection; Task-specific model fine-tuning
TL;DR: We propose a proxy-label enhanced distribution matching method for task-specific model finetuning which shows superior performance in multiple benchmarks.
Abstract: Task-specific fine-tuning of foundation models is critically dependent on the quality and relevance of the instruction data. While prevailing data selection methods rely exclusively on instruction instances X to approximate the target distribution, we argue that selection should align with the joint distribution of instructions and task-specific labels (X,Y), However, task-specific labels Y are typically unavailable in practice. To address this, we reformulate the task-specific data selection problem and present a novel pipeline that leverages the reasoning capabilities of large language models (LLMs) to infer proxy labels, thereby facilitating joint distribution alignment. Our approach begins by propagating proxy labels from a small target set to a large, unlabeled source corpus. A two-stage filtering process then removes instances with label noise and refines the subset through distribution alignment. This strategy produces more semantically meaningful and task-aware selections than conventional similarity measures based on X alone. Experimental results show that fine-tuning on a subset of only 10K samples—selected from a pool of 300K—achieves performance competitive or superior to state-of-the-art methods.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 11624
Loading