Keywords: Large Language Models, Reasoning, Data Efficiency, Supervised Fine-Tuning, Collaborative Filtering
TL;DR: A data selection approach balancing difficulty and diversity for efficient fine-tuning of LLMs with minimal annotation effort.
Abstract: The performance of fine-tuned language models is heavily influenced by the quality and quantity of their fine-tuning data. While scaling laws suggest that larger models benefit from more data during pretraining, the Less-is-More hypothesis highlights that downstream fine-tuning often requires only a small but high-quality dataset to effectively elicit a model’s pretrained knowledge. However, identifying such premium data, particularly in terms of difficulty and diversity, typically relies on human expertise, and existing methods offer limited guidance for automatic selection from large unannotated corpora. This work presents a novel quantitative framework that formalizes the interplay between question difficulty and diversity, and introduces *Difficulty–Diversity Collaborative Filtering* (DDCF): an automated approach that tailors data selection to the unique characteristics of each language model via collaborative filtering. By leveraging a small seed dataset to predict correctness across a large unannotated corpus, our method reduces the annotation cost by $100-200\times$, while maintaining downstream performance comparable to full-corpus fine-tuning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 15954
Loading