TL;DR: We propose a fine-grained token cleaning pipeline for LLM instruction tuning to boost overall performance.
Abstract: Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity.
While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant, uninformative, or even harmful. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance.
In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/UCSC-REAL/TokenCleaning.
Lay Summary: Large language models (LLMs) learn to write, summarize, or answer questions by training on vast amounts of text. But not all parts of this text are equally helpful — some words or phrases might be irrelevant or even misleading for the task at hand.
We asked: instead of filtering out whole examples during training, what if we look inside each example and clean out just the bad words? We created a method to do exactly that — it checks how much each word helps or hurts the model’s learning, and removes the unhelpful ones while keeping the important ones.
This process, called token cleaning, led to models that performed better across a wide range of tasks. It works with both a fixed model or by letting the model evolve as it learns.
Link To Code: https://github.com/UCSC-REAL/TokenCleaning
Primary Area: Deep Learning->Large Language Models
Keywords: LLM, Supervised Fine-tuning, Data Selection, Token Cleaning
Submission Number: 5167
Loading