Abstract: Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses. Supervised fine-tuning specializes a LLM by training it on dataset of example prompts with desired target responses, but real-world data tends to be noisy. While many fine-tuning algorithms exist, here we consider a \emph{data-centric AI} perspective on LLM fine-tuning, studying how to \emph{systematically} curate the training dataset to improve the LLM produced via \emph{any} fine-tuning algorithm.We introduce an automated data curation pipeline \modelnameA{} (\underline{\textbf{C}}onfidence-based \underline{\textbf{L}}LM \underline{\textbf{E}}valuation \underline{\textbf{A}}nd \underline{\textbf{R}}ectification) for instruction tuning datasets, that can be used with any LLM and fine-tuning procedure. CLEAR estimates which training data is low-quality and either filters or corrects it. Quality scoring and the decision to filter or correct are based on LLM-derived confidence estimates, which ensure algorithmic modifications to the dataset remain suitably conservative. Unlike existing data curation techniques, CLEAR is a comprehensive framework that can improve a dataset (and trained model outputs) without additional fine-tuning computations. Importantly, we don't assume access to a stronger LLM than the model being fine-tuned (e.g.\ relying on GPT-4 when fine-tuning GPT-3.5), to see whether CLEAR can meaningfully improve the capabilities of any LLM. Our experiments reveal that CLEAR consistently improves the performance of fine-tuned models across many datasets and models (like GPT-3.5 and Llama2).
Paper Type: long
Research Area: Generation
Languages Studied: English
0 Replies
Loading