Greedy Information Projection for LLM Data Selection

Greedy Information Projection for LLM Data Selection

ICLR 2026 Conference Submission16634 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data selection, large language model, fine-tuning

TL;DR: Greedy Information Projection (GIP) uses a mutual-information objective with efficient greedy algorithms to pick small, high-value, diverse training subsets from general query signals, matching full-data performance at a fraction of the cost.

Abstract: We present Greedy Information Projection (GIP), a principled framework for choosing training examples for large language model fine-tuning. GIP casts selection as maximizing mutual information between a compact subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. Under a jointly Gaussian model of data and query embeddings, the objective has a closed form and naturally balances quality and diversity. We show that optimizing this score is equivalent to maximizing the projection of the query embedding matrix onto the span of the selected data, yielding a geometric explanation for the co-emergence of quality and diversity. Building on this view, we develop a fast greedy matching-pursuit procedure with efficient projection-based updates. On instruction-following and mathematical reasoning datasets, GIP selects compact subsets that match full-data fine-tuning while using only a small fraction of examples and compute, unifying quality-aware and diversity-aware selection for efficient fine-tuning.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 16634

Loading