TokenTune: Dual-Level Utility Estimation for Scalable Data Selection in Instruction Tuning

ICLR 2026 Conference Submission19719 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Selection, LLMs, Instruction Tuning
Abstract: Recent studies indicate that data quality is more important than quantity for fine-tuning of large language models (LLMs). However, existing data selection methods face two key limitations. First, they lack an effective utility estimation function: sample-level utility computes the score for entire examples but ignores which tokens are actually useful, while token-level methods drop tokens with multiple valid answers and thus remove valuable learning signals. Second, these methods are inefficient because they require full-dataset inference to compute utilities, making them prohibitively expensive at scale. To address these challenges, we propose ToneTune, an efficient data selection framework for instruction tuning. The key idea of TokenTune is a dual-level utility function that operates at both the token and sample levels. At the token level, it identifies learnable tokens that still provide strong gradient signals and multi-answer tokens that preserve diversity under incomplete supervision. At the sample level, it derives a utility score directly from token signals, avoiding redundant full-dataset inference. To further scale, TokenTune employs a two-stage design. In the selection stage, a multi-armed bandit adaptively prioritizes informative clusters, from which high-utility samples are chosen using the sample-level score. In the training stage, the token-level utility guides gated optimization: learnable tokens strengthen supervision, while multi-answer tokens preserve diversity. Extensive experiments across 7 benchmarks show that TokenTune significantly outperforms state-of-the-art methods, improving average performance by +3.8% while using only 5% of the full training data and reducing overall training time by 8-10×.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 19719
Loading