TRIM: TOKEN-BUDGETED DATA MINING FOR INSTRUCTION TUNING

Md Muntaqim Meherab; SALMAN; Naimur Rahman; Md. Maruf Billah; Tanvirul Islam; Dr. Fernaz Narin Nur; Md. Hasanuzzaman Dipu

TRIM: TOKEN-BUDGETED DATA MINING FOR INSTRUCTION TUNING

Md Muntaqim Meherab, SALMAN, Naimur Rahman, Md. Maruf Billah, Tanvirul Islam, Dr. Fernaz Narin Nur, Md. Hasanuzzaman Dipu

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0

Keywords: Token-budgeted data selection, Instruction tuning (SFT), MinHash deduplication, Embedding-based diversity sampling, LoRA/QLoRA fine-tuning.

TL;DR: TRIM fixes the SFT token budget (≈800k) to isolate data-selection effects, and shows on TinyLlama-1.1B-Chat (Alpaca+Dolly) that MinHash dedup and MinHash+embedding diversity perform on par with random (held-out PPL Δ<0.2%, toxicity ≈0).

Abstract: Data filtering is often assumed to improve instruction tuning, but practitioners rarely control for token budget—the most binding constraint in small-scale finetuning. We introduce TRIM (Token-budget-Regulated Instruction data Mining), a simple pipeline that enforces an explicit token cap during data selection, and use it to study: at a fixed training-token budget, do common filtering heuristics beat random selection? Using a 800,000-token cap over a mixed instruction dataset (Alpaca + Dolly), we compare (i) random selection, (ii) near-duplicate removal via MinHash, and (iii) MinHash + embedding-based diversity selection. We finetune TinyLlama/TinyLlama-1.1B-Chat-v1.0 with QLoRA/LoRA-style adapters and evaluate held-out perplexity plus a lightweight toxicity probe. In this regime, both deduplication variants match but do not improve on random selection: held-out PPL differs by < 0.2% and random is slightly best. Our takeaway is deliberately modest: under tight token budgets and standard instruction corpora, “obvious” deduplication/diversity steps may be lower-impact than expected unless paired with stronger quality signals, larger budgets, or multi-seed evaluation

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 122

Loading