Keywords: Token-budgeted data selection, Instruction tuning (SFT), MinHash deduplication, Embedding-based diversity sampling, LoRA/QLoRA fine-tuning.
TL;DR: TRIM fixes the SFT token budget (≈800k) to isolate data-selection effects, and shows on TinyLlama-1.1B-Chat (Alpaca+Dolly) that MinHash dedup and MinHash+embedding diversity perform on par with random (held-out PPL Δ<0.2%, toxicity ≈0).
Abstract: Data filtering is often assumed to improve instruction tuning, but practitioners
rarely control for token budget—the most binding constraint in small-scale finetuning. We introduce TRIM (Token-budget-Regulated Instruction data Mining), a
simple pipeline that enforces an explicit token cap during data selection, and use
it to study: at a fixed training-token budget, do common filtering heuristics beat
random selection? Using a 800,000-token cap over a mixed instruction dataset
(Alpaca + Dolly), we compare (i) random selection, (ii) near-duplicate removal
via MinHash, and (iii) MinHash + embedding-based diversity selection. We finetune TinyLlama/TinyLlama-1.1B-Chat-v1.0 with QLoRA/LoRA-style
adapters and evaluate held-out perplexity plus a lightweight toxicity probe. In this
regime, both deduplication variants match but do not improve on random selection:
held-out PPL differs by < 0.2% and random is slightly best. Our takeaway is
deliberately modest: under tight token budgets and standard instruction corpora,
“obvious” deduplication/diversity steps may be lower-impact than expected unless
paired with stronger quality signals, larger budgets, or multi-seed evaluation
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 122
Loading