DATA-FM: TOKEN-BUDGETED DATA FILTERING FOR INSTRUCTION TUNING

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: token-budgeted data selection, instruction tuning (SFT), MinHash deduplication, embedding-based diversity sampling, LoRA/QLoRA fine-tuning.
TL;DR: DATA-FM fixes the SFT token budget (≈800k) to isolate data-selection effects, and shows on TinyLlama-1.1B-Chat (Alpaca+Dolly) that MinHash dedup and MinHash+embedding diversity perform on par with random (held-out PPL Δ<0.2%, toxicity ≈0).
Abstract: Data filtering is often assumed to improve instruction tuning, but practitioners rarely control for token budget—the most binding constraint in small-scale finetuning. We study a simple question: at a fixed training-token budget, do common filtering heuristics beat random selection? Using a 800,000-token cap over a mixed instruction dataset (Alpaca + Dolly), we compare (i) random selection, (ii) nearduplicate removal via MinHash, and (iii) MinHash + embedding-based diversity selection. We finetune TinyLlama/TinyLlama-1.1B-Chat-v1.0 with QLoRA/LoRA-style adapters and evaluate held-out perplexity plus a lightweight toxicity probe. In this regime, both deduplication variants match but do not improve on random selection: held-out PPL differs by < 0.2% and random is slightly best. Our takeaway is deliberately modest: under tight token budgets and standard instruction corpora, “obvious” deduplication/diversity steps may be lower-impact than expected unless paired with stronger quality signals, larger budgets, or multiseed evaluation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 122
Loading