Keywords: local learning, test-time fine-tuning, transductive learning, data selection, retrieval, active learning, transductive active learning, language modeling, uncertainty quantification
TL;DR: We develop SIFT, an effective data selection method for fine-tuning LLMs. We show that test-time fine-tuning with SIFT can significantly and robustly improve language modeling ability.
Abstract: Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach tends to select redundant data, limiting its effectiveness or even hurting performance. To address this, we introduce SIFT, a data selection algorithm designed to reduce uncertainty about responding to the prompt, which unifies ideas from retrieval and active learning. SIFT accounts for redundant information and optimizes the overall information gain of the selected examples. Our evaluations, focusing on prompt-specific fine-tuning at test-time, show that SIFT consistently outperforms Nearest Neighbor retrieval in language modeling on the Pile dataset, with minimal computational overhead. Whereas Nearest Neighbor retrieval typically fails in the presence of information duplication, SIFT is entirely robust to such cases.
Submission Number: 44
Loading